I need to extract the numbers form a string in a column using regex_extract. I'm using Impala over a external Table.
I have already checked the regular expression, and to test it, I also used regexp_like and regexp_replace. Both of them worked just perfect.
Here is the query:
select
sucursal,
regexp_like(sucursal,'^[0-9]{1,3}') as match,
regexp_extract(sucursal,'^[0-9]{1,3}',1) as CodSucusal,
regexp_replace(sucursal,'^[0-9]{1,3}','lala') as RepCodSucusal
from jdv.stg_devoluciones limit 5;
here is the results:
+-------------------+-------+------------+--------------------+
| sucursal | match | codsucusal | repcodsucusal |
+-------------------+-------+------------+--------------------+
| 124 NAVOJOA | true | | lala NAVOJOA |
| 73 BOCA DEL RIO | true | | lala BOCA DEL RIO |
| 964 JIUTEPEC | true | | lala JIUTEPEC |
| 456 TEQUISQUIAPAN | true | | lala TEQUISQUIAPAN |
| 212 LANDIN | true | | lala LANDIN |
+-------------------+-------+------------+--------------------+
codsucursal should be the sucursal number but regexp_extract is returning null instead
Expected results:
+-------------------+-------+------------+--------------------+
| sucursal | match | codsucusal | repcodsucusal |
+-------------------+-------+------------+--------------------+
| 124 NAVOJOA | true | 124 | lala NAVOJOA |
| 73 BOCA DEL RIO | true | 73 | lala BOCA DEL RIO |
| 964 JIUTEPEC | true | 964 | lala JIUTEPEC |
| 456 TEQUISQUIAPAN | true | 456 | lala TEQUISQUIAPAN |
| 212 LANDIN | true | 212 | lala LANDIN |
+-------------------+-------+------------+--------------------+
What am I doing wrong?
Impala regexp_extract
function takes the index of a capturing group or 0
for the whole match as the third argument.
regexp_extract(string subject, string pattern, int index)
Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0 refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so on(...)
portion.
As your regex - ^[0-9]{1,3}
- has no capturing groups defined in it you should use 0
as the third argument to refer to the whole match value:
regexp_extract(sucursal,'^[0-9]{1,3}', 0) as CodSucusal