Search code examples
sqlaws-glueamazon-athenaprestotrino

Athena query JSON Array without struct


In Athena how can I structure a select statement to query the below by timestamp? The data is stored as a string

[{
    "data": [{
        "ct": "26.7"
    }, {
        "ct": "24.9",
    }, {
        "ct": "26.8",
    }],
    "timestamp": "1658102460"
}, {
    "data": [{
        "ct": "26.7",
        }, {
        "ct": "25.0",
        }],
    "timestamp": "1658102520"
}]

I tried the below but it just came back empty.

SELECT json_extract_scalar(insights, '$.timestamp') as ts
FROM history

What I am trying to get to is returning only the data where a timestamp is between X & Y.

When I try doing this as a struct and a cross join with unnest it's very very slow so I am trying to find another way.


Solution

  • json_extract_scalar will not help here because it returns only one value. Trino improved vastly json path support but Athena has much more older version of the Presto engine which does not support it. So you need to cast to array and use unnest (removed trailing commas from json):

    -- sample data
    WITH dataset (json_str) AS (
        values ('[{
        "data": [{
            "ct": "26.7"
        }, {
            "ct": "24.9"
        }, {
            "ct": "26.8"
        }],
        "timestamp": "1658102460"
    }, {
        "data": [{
            "ct": "26.7"
            }, {
            "ct": "25.0"
            }],
        "timestamp": "1658102520"
    }]')
    )
    
    -- query
    select mp['timestamp'] timestamp,
        mp['data'] data
    from dataset,
    unnest(cast(json_parse(json_str) as array(map(varchar, json)))) as t(mp)
    

    Output:

    timestamp data
    1658102460 [{"ct":"26.7"},{"ct":"24.9"},{"ct":"26.8"}]
    1658102520 [{"ct":"26.7"},{"ct":"25.0"}]

    After that you can apply filtering and process data.