Search code examples
sqlprestounnest

Presto unnest/map complex json


I've been on stack for a few hours exploring examples of other presto unnest/map/cast solutions but I can't seem to find one that works for my data.

Here's a sample of my data:

with test_data (id, messy_json) AS (
    VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
           ('TEST_B', JSON '{"problem":[],"version":4}'),
           ('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
),

The JSON column is semi-unstructured and can hold multiple lvls / doesn't always have every key:value pair as other rows.

I was trying solutions like:

with test_data AS (
select id,
       messy_json        
from larger_tbl),


select 
    id as id,
    json_extract_scalar(test_data, '$.version') as lvl1_version
    json_extract_scalar(lvl2, '$.problem') as lvl2_id
    from test
LEFT JOIN UNNEST(CAST(json_parse(messy_json) AS array(json))) AS x(lvl1) ON TRUE
LEFT JOIN UNNEST(CAST(json_extract(lvl1, '$.problem') AS array(json))) AS y(lvl2) ON TRUE

This gets me cast errors etc. I've tried some variations with

unnest(cast(json_col as map(varchar, map(varchar,varchar)) options too.

My goal is to explode the entire dataset with the retained ID and all keys/multi-lvl keys retained in a long dataset. I appreciate any input/guidance, thanks!


Solution

  • To be honest your current query does not make much sense based on provided sample data and stated goals.

    • UNNEST is used with cross join rather then left join (I use succinct syntax skipping the keyword completely)
    • No need to json_parse(messy_json) in your test data since it is already json (though I assume in the real data it just a varchar field then it should be kept)
    • Root json is a json object, so it can't be casted to array (and there is no need to).
    • problem in json_extract_scalar(lvl2, '$.problem') as lvl2_id should be changed to id since problem array was already unnested
    • To handle cases where the problem array is empty you can use approach with unnesting multiple arrays with a dummy one with 1 element
    -- sample data
    with test_data (id, messy_json) AS (
        VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
               ('TEST_B', JSON '{"problem":[],"version":4}'),
               ('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
    )
    
    -- query
    select id
         , json_extract_scalar(messy_json, '$.version') as lvl1_version
         , json_extract_scalar(lvl2, '$.id')            as lvl2_id
    from test_data
       , UNNEST(CAST(json_extract(messy_json, '$.problem') 
           AS array(json)), array[1]) AS y(lvl2, ignored);
    

    Output:

    id lvl1_version lvl2_id
    TEST_A 4 2
    TEST_A 4 3
    TEST_A 4 25
    TEST_A 4 27
    TEST_A 4 1
    TEST_B 4 NULL
    TEST_C 4 NULL