Search code examples
sqlregexselecthive

hive regex find pattern and return it in select statement


I would like to extract 3 words before the selay dervice but the query returns an empty column :(

with a as (
        select * from tablename1 b 
        where lower(ptranscript) rlike 'selay dervice'
        )
        select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\s+){3}selay dervice',0)  from a

##########update 1

as pointed by Raid earlier, in Hive we cannot use \s and have to use \\s. I updated the above regex accordingly and it works

with a as (
            select * from tablename1 b 
            where lower(ptranscript) rlike 'selay dervice'
            )
            select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\\s+){3}selay dervice',0)  from a

Solution

  • Try below:

    with a as (
            select * from tablename1 b 
            where lower(ptranscript) rlike 'selay dervice'
            )
            select *,regexp_extract(lower(a.ptranscript),'(?:[a-zA-Z0-9]+ ){3}selay dervice',0)  from a
    
    

    Note that if there are less than 3 words before selay dervice you will get empty results.

    I tested similar query in latest apache hive and got something like below:

    +----------------------------------+-----------------------------+
    |               key                |          regex_ext          |
    +----------------------------------+-----------------------------+
    | rlk1 selay dervice               |                             |
    | selay dervice k4                 |                             |
    | k5 selay dervice ew              |                             |
    | thre word b4 selay dervice       | thre word b4 selay dervice  |
    | four word be four selay dervice  | word be four selay dervice  |
    +----------------------------------+-----------------------------+
    
    

    Edit 1: Result does not vary with or without ? All 3 versions below gives same result.

    1. '(?:[a-zA-Z0-9]+ )'
    2. '([a-zA-Z0-9]+ )'
    3. '([a-zA-Z0-9]+\\s)'

    As per docs \s matches any white space not just spacebar