Search code examples
sqlregexprestotrino

Regular expression to stop at first match of one or the other character


I am parsing a json in presto and applying regular expression on top of it.

{"value":"{\"123\":{\"type\":\"TAGGING\",\"role\":\"RESULT\",\"payload\":[{\"values\":{\"qbc\":\"not-abc\",\"def\":\"yes\",\"ghi\":\"gh\"}}]}}"}

Regexp used :
REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(
        Opfields,
        '$.value'
    ),
    '.*"abc":(.*?),',
    1
) abc

Now this works for abc as there is a comma after abc value but does not work for ghi. For ghi I know in this example I could use } in this case. But the problem is that these scaler values like abc , ghi can occur in different order in which case it might end up with a comma or }.

I am looking for an expression which would work in every case no matter what the order is. Essentially I want the values for abc, def and ghi no matter in which order they appear in the json.

Can someone please advice a good way to do this.

I have mentioned the regexp I have tried


Solution

  • You don't need to use regex here, based on the shown data you seem to have just a double encoded JSON here (i.e. value contains a JSON string which holds an encoded JSON) so you can parse it again and process it as JSON. With Presto (due to it's limited JSON path support) it will be a bit cumbersome involving some casts to maps and arrays:

    -- sample data
    with dataset(opfields) as(
        values ('{"value":"{\"123\":{\"type\":\"TAGGING\",\"role\":\"RESULT\",\"payload\":[{\"values\":{\"abc\":\"not-abc\",\"def\":\"yes\",\"ghi\":\"gh\"}}]}}"}')
    )
    
    -- query
    select JSON_EXTRACT(js1, '$.values.abc') abc,
           JSON_EXTRACT(js1, '$.values.ghi') ghi
    from dataset,
        unnest (cast(JSON_PARSE(JSON_EXTRACT_SCALAR(opfields, '$.value')) as map(varchar, json))) as t(k, js),
        unnest (cast(JSON_EXTRACT(js, '$.payload') as array(json))) as t1(js1);
    

    Output:

       Abc   | ghi
    -----------------+-----------
     "not-abc" | "gh"
    

    If you actually are using Trino you can use json_query:

    select json_query(JSON_EXTRACT_SCALAR(opfields, '$.value'), 'strict $.*.payload[*].values.abc' WITHOUT ARRAY WRAPPER) sensitivity,
           json_query(JSON_EXTRACT_SCALAR(opfields, '$.value'), 'strict $.*.payload[*].values.ghi' WITHOUT ARRAY WRAPPER) certainty
    from dataset;