Search code examples
regexhiveexpressioncloudera

input.regex in Hive


A question was asked earlier for the given dataset.

03-24-2014  fm506   TOTAL-PROCESS   OK;HARD;1;PROCS OK: 717 processes
03-24-2014  fm504   CHECK-LOAD  OK;SOFT;2;OK - load average: 54.61, 56.95

The input regex provided in that thread is not at all working hence I created two "input regex" and tested the first regex in "http://www.regexplanet.com/advanced/java/index.html". The groups are perfect. But when I am trying in Hive, it's loading only NULL values.

input regex I provided as below

([^ ]*)\t+([^ ]*)\t+([^ ]*)\t+([^ ]*)

My second input regex is

^(\\S+)\\t+(\\S+)\\t+(\\S+)\\t+(\\S+)$

I thought it will work but it's also not loading NULL values.

Could you please let me know what's wrong with these two input regex?


Solution

  • Your first pattern does not match the entire string, and field matching parts are [^ ]*, that is, any 0+ chars other than a space, so the last field cannot be matched (it contains spaces).

    The second regex also contains \S+ patterns matching 1 or more chars other than whitespace, and the last one does not match the last field.

    You may use

    ^(\S+)\t+(\S+)\t+(\S+)\t+(.+)
    ^([^\t]*)\t+([^\t]*)\t+([^\t]*)\t+(.*)
    

    See the regex demo

    The [^\t]* matches any field in a tab-delimited text since it matches zero or more chars other than a tab.