Search code examples
sqlscalaapache-sparkpresto

Using JSON_EXTRACT or JSON_EXTRACT_SCALAR in Presto SQL or Scala


id value
123 {78kfcX={"Sent": 77, "Respond": 31, "NoResponse": 31}, 97Facz={"Sent": 45, "Respond": 31, "NoResponse": 31}}
333 {5mdzrZ={"Sent": 1, "Respond": 1, "NoResponset": 1}}

Given the table above, I am trying to extract the "Sent" value... In cases where there are multiple sent values then I want to take the max.

I have tried using json_extract, json_extract_scalar, json_parse and multiple other functions in SQL but nothing seems to work. I keep getting NULL values in all my attempts.

The expected outcome given the example above should be:

id Sent
123 77
333 1

I think one way to approach this is by first doing a CROSS JOIN UNNEST to split the value column by 78kfcX, 97Facz, 5mdzrZ ids. And then extracting the sent value from there and taking the max, grouped by the id column.

Attempted code:

SELECT 
id, 
json_extract_scalar(value, '$.Sent') AS 'sent'
FROM table

JSON_PARSE(value) returns the following error:

Cannot convert value to JSON: '{78kfcX={"Sent": 77, "Respond": 31, "NoResponse": 31}, 97Facz={"Sent": 45, "Respond": 31, "NoResponse": 31}}'


Solution

  • answer for original question The input string itself is not in JSON format. Therefore, we need to first extract {"Sent": 77, "Respond": 31, "NoResponse": 31} out by regular expression, and then apply JSON_EXTRACT_SCALAR() function on it.

    The following regex pattern can extract the attribute maps out from the input (Regex101 verification):

    \{".*?\}
    

    With alll above tricks, in Presto, you can get the result you want in the following steps:

    Step 1. Use regexp_extract_all() function to extract all attribute maps out of the input string into an array.

    Step 2. Apply JSON_extract_scalar() to each element of the array and extract Sent part out

    Step 3. Use array_max() function to get the max value you want.

    Here is the query in Presto:

    SELECT
        id,
        APPAY_MAX(
             TRANSFORM(
                 REGEXP_EXTRACT_ALL(value, '\{".*?\}'),
                 v -> CAST(JSON_EXTRACT_SCALAR(v, '$.Sent') AS INT)
             )
       ) AS sent
    FROM
        input_table
    
    id Sent
    123 77
    333 1

    answer for the follow-up If we want to calculation sum instead of max from the array, we can use reduce() function and lamda expression to do that.

    Here is the query:

    SELECT
        id,
        REDUCE(
             TRANSFORM(
                 REGEXP_EXTRACT_ALL(value, '\{".*?\}'),
                 v -> CAST(JSON_EXTRACT_SCALAR(v, '$.Sent') AS INT)
             ), 0, (s,x) -> s + x, s -> s
       ) AS sent
    FROM
        input_table
    
    id Sent
    123 122
    333 1