Search code examples
regexhadoopetlimpala

Impala regexp_like query returning null when regexp_like and regexp_extract works fine


I need to extract the numbers form a string in a column using regex_extract. I'm using Impala over a external Table.

I have already checked the regular expression, and to test it, I also used regexp_like and regexp_replace. Both of them worked just perfect.

Here is the query:

select 
    sucursal,
    regexp_like(sucursal,'^[0-9]{1,3}') as match,
    regexp_extract(sucursal,'^[0-9]{1,3}',1) as CodSucusal,
    regexp_replace(sucursal,'^[0-9]{1,3}','lala') as RepCodSucusal
from jdv.stg_devoluciones limit 5;

here is the results:

+-------------------+-------+------------+--------------------+
| sucursal          | match | codsucusal | repcodsucusal      |
+-------------------+-------+------------+--------------------+
| 124 NAVOJOA       | true  |            | lala NAVOJOA       |
| 73 BOCA DEL RIO   | true  |            | lala BOCA DEL RIO  |
| 964 JIUTEPEC      | true  |            | lala JIUTEPEC      |
| 456 TEQUISQUIAPAN | true  |            | lala TEQUISQUIAPAN |
| 212 LANDIN        | true  |            | lala LANDIN        |
+-------------------+-------+------------+--------------------+

codsucursal should be the sucursal number but regexp_extract is returning null instead

Expected results:

+-------------------+-------+------------+--------------------+
| sucursal          | match | codsucusal | repcodsucusal      |
+-------------------+-------+------------+--------------------+
| 124 NAVOJOA       | true  |   124      | lala NAVOJOA       |
| 73 BOCA DEL RIO   | true  |   73       | lala BOCA DEL RIO  |
| 964 JIUTEPEC      | true  |   964      | lala JIUTEPEC      |
| 456 TEQUISQUIAPAN | true  |   456      | lala TEQUISQUIAPAN |
| 212 LANDIN        | true  |   212      | lala LANDIN        |
+-------------------+-------+------------+--------------------+

What am I doing wrong?


Solution

  • Impala regexp_extract function takes the index of a capturing group or 0 for the whole match as the third argument.

    regexp_extract(string subject, string pattern, int index)
    Purpose: Returns the specified () group from a string based on a regular expression pattern. Group 0 refers to the entire extracted string, while group 1, 2, and so on refers to the first, second, and so on (...) portion.

    As your regex - ^[0-9]{1,3} - has no capturing groups defined in it you should use 0 as the third argument to refer to the whole match value:

    regexp_extract(sucursal,'^[0-9]{1,3}', 0) as CodSucusal