I've been on stack for a few hours exploring examples of other presto unnest/map/cast solutions but I can't seem to find one that works for my data.
Here's a sample of my data:
with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
),
The JSON column is semi-unstructured and can hold multiple lvls / doesn't always have every key:value pair as other rows.
I was trying solutions like:
with test_data AS (
select id,
messy_json
from larger_tbl),
select
id as id,
json_extract_scalar(test_data, '$.version') as lvl1_version
json_extract_scalar(lvl2, '$.problem') as lvl2_id
from test
LEFT JOIN UNNEST(CAST(json_parse(messy_json) AS array(json))) AS x(lvl1) ON TRUE
LEFT JOIN UNNEST(CAST(json_extract(lvl1, '$.problem') AS array(json))) AS y(lvl2) ON TRUE
This gets me cast errors etc. I've tried some variations with
unnest(cast(json_col as map(varchar, map(varchar,varchar)) options too.
My goal is to explode the entire dataset with the retained ID and all keys/multi-lvl keys retained in a long dataset. I appreciate any input/guidance, thanks!
To be honest your current query does not make much sense based on provided sample data and stated goals.
UNNEST
is used with cross join
rather then left join
(I use succinct syntax skipping the keyword completely)json_parse(messy_json)
in your test data since it is already json (though I assume in the real data it just a varchar field then it should be kept)problem
in json_extract_scalar(lvl2, '$.problem') as lvl2_id
should be changed to id
since problem
array was already unnested-- sample data
with test_data (id, messy_json) AS (
VALUES ('TEST_A', JSON '{"issue":[],"problem":[{"category":"math","id":2,"name":"subtraction"},{"category":"math","id":3,"name":"division"},{"category":"english","id":25,"name":"verbs"},{"category":"english","id":27,"name":"grammar"},{"category":"language","id":1,"name":"grammar"}],"version":4}'),
('TEST_B', JSON '{"problem":[],"version":4}'),
('TEST_C', JSON '{"version": 4, "problem": [], "issue": [null, null, null, null, null, null, null, null, null, null, null]}')
)
-- query
select id
, json_extract_scalar(messy_json, '$.version') as lvl1_version
, json_extract_scalar(lvl2, '$.id') as lvl2_id
from test_data
, UNNEST(CAST(json_extract(messy_json, '$.problem')
AS array(json)), array[1]) AS y(lvl2, ignored);
Output:
id | lvl1_version | lvl2_id |
---|---|---|
TEST_A | 4 | 2 |
TEST_A | 4 | 3 |
TEST_A | 4 | 25 |
TEST_A | 4 | 27 |
TEST_A | 4 | 1 |
TEST_B | 4 | NULL |
TEST_C | 4 | NULL |