Search code examples
jsonregexhiveimpala

Escaping RegEx in JSON for Hive


I have a JSON file in which I put my SQL query and it will later on be run automatically on Hive.

Structure of the JSON is as-follow:

{
    "name": "query1",
    "query": "select regexp_extract(column, '(.*)\\s\\|', 1) as column_one from data"
}

The idea is to extract everything until the first space + vertical bar.

For a given example, when I try in Hive, it works as expected:

select regexp_extract('First part | Second Part', '(.*)\\s\\|', 1) as column_one;

First part

As you can see you already need one more backslash for \s and \| otherwise is does not work in Hive. However when run automatically using the JSON file, I get the following:

Fir

Then I understood you need another backslash to escape it in JSON so I went with the following:

{
    "name": "query1",
    "query": "select regexp_extract(column, '(.*)\\\s\\\|', 1) as column_one from data"
}

But still, it gives me the Fir instead of First part.


Solution

  • You can use

    {
        "name": "query1",
        "query": "select regexp_extract(column, '([^|]*?)[[:space:]]*[|]', 1) as column_one from data"
    }
    

    Note that

    • ([^|]*?) - Captures any zero or more chars other than | as few as possible
    • [[:space:]]* - zero or more whitespace chars
    • [|] - a literal | char (inside a character class, [...], the | char is treated as a literal pipe symbol).