Search code examples
regexhiveimpala

Regex - extract last term between _ and before . from path


This is the regex that I'm currently testing

[\w\. ]+(?=[\.])

My ultimate goal is to include a regex expression to extract using regexp_extract in Impala/Hive query.

regexp_extract(col, '[\w\. ]+(?=[\.])', 1)

This doesn't work in Impala however.

Examples of path to extract from:

D:\mypath\Temp\abs\device\Program1.lua
D:\mypath\Temp\abs\device\SE1_Test-program.lua
D:\mypath\Temp\abs\device\Test_program.lua
D:\mypath\Temp\abs\device\Device_Test_Case-general.lua

The regex I've tested extracts the term I'm looking for but it's not good enough, for the second and third, fourth cases I would need to extract only the part after the last underscore.

enter image description here

My expections are:

Program1
Test-program
program
Case-general

Any suggestions? I'm also open to using something other than regexp_extract.


Solution

  • Note that Impala regex does not support lookarounds, and thus you need a capturing group to get a submatch out of the overall match. Also, if you use escaping \ in the pattern, make sure it is doubled.

    You can use

    regexp_extract(col, '([^-_\\\\]+)\\.\\w+$', 1)
    

    See the regex demo.

    The regex means

    • ([^-_\\]+) - Group 1: one or more chars other than -, _ and \
    • \. - a dot
    • \w+ - one or more word chars
    • $ - end of string.