Search code examples
sqljsonamazon-athenaprestounnest

Extracting JSON objects in Athena / Presto for dynamic keys


I have two string formatted columns in my table. One of the columns has json objects in it. The keys in the JSON objects are not fixed - so the problem is I cant use standard json extract functions. Here is how the table looks

timestamp upstream_tables
2023-02-02T17:34:55.946Z {"ETL_table_1":true,"ETL_table_2":true}
2023-02-02T13:30:11.882Z {"ETL_table_3":true}
2023-02-02T20:12:22.116Z {"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}

I want to flatten this table to something like below

timestamp upstream_tablename was_completed
2023-02-02T17:34:55.946Z ETL_table_1 TRUE
2023-02-02T17:34:55.946Z ETL_table_2 TRUE
2023-02-02T13:30:11.882Z ETL_table_3 TRUE
2023-02-02T20:12:22.116Z ETL_table_4 TRUE
2023-02-02T20:12:22.116Z ETL_table_5 TRUE
2023-02-02T20:12:22.116Z ETL_table_6 FALSE

Can anyone please help? Have spent a lot of time using map_values and map_keys but couldnt get it right.

The only closest thing I could come up with was this

select
   timestamp,
    t.*
FROM mytable
   CROSS JOIN UNNEST(map_entries(CAST(json_extract(upstream_tables, '$') AS MAP(VARCHAR, VARCHAR)))) AS t

Solution

  • @martin-traverso's answer can be used with Athena engine v. 3 which is based on Trino, for both v.2 and v.3 the main trick of casting to map you have discovered, I would switch from using json_extract to json_parse (to transform from string to json), skip map_entries (Presto/Trino can unnest maps to key-value pairs, optionally use MAP(VARCHAR, JSON) as target type) and specify column names for unnest result, Presto/Trino can unnest maps to key-value pairs:

    WITH data(ts, value) AS (
        VALUES
            (from_iso8601_timestamp('2023-02-02T17:34:55.946Z'), VARCHAR '{"ETL_table_1":true,"ETL_table_2":true}'),
            (from_iso8601_timestamp('2023-02-02T13:30:11.882Z'), VARCHAR '{"ETL_table_3":true}'),
            (from_iso8601_timestamp('2023-02-02T20:12:22.116Z'), VARCHAR '{"ETL_table_4":true,"ETL_table_5":true,"ETL_table_6":false}')
    
    )
    select
       ts,
        t.*
    FROM data
       CROSS JOIN UNNEST(CAST(json_parse(value) AS MAP(VARCHAR, JSON))) AS t(upstream_tablename, was_completed);
    
    ts upstream_tablename was_completed
    2023-02-02 17:34:55.946 UTC ETL_table_1 true
    2023-02-02 17:34:55.946 UTC ETL_table_2 true
    2023-02-02 13:30:11.882 UTC ETL_table_3 true
    2023-02-02 20:12:22.116 UTC ETL_table_4 true
    2023-02-02 20:12:22.116 UTC ETL_table_5 true
    2023-02-02 20:12:22.116 UTC ETL_table_6 false