Search code examples
sqlamazon-web-servicesamazon-athenaprestotrino

Athena unnest for nested Array string column


I have a column in athena whose datatype is string and data looks like:

[{'main_key1': {'key1': value, 'key2': value2}, 'outer_key': outer_value},{'main_key2': {'key1': value, 'key2': value2}, 'outer_key': outer_value1}]

I have to split the value of this list (outer_key1) across multiple rows.

for eg:

id outer_key
1 outer_value
1 outer_value1

Can anyone please help if there is a way to acheive this

I already tried json_parse, unnest but everything is ending in some kind of exception


Solution

  • It depends on what your data actually is. If actually is a valid JSON string then you can process it as JSON:

    -- sample data
    with dataset(some_json) as(
        values ('[{"main_key1": {"key1": 1, "key2": "value2"}, "outer_key": "outer_value"},{"main_key2": {"key1": 1, "key2": 2}, "outer_key": "outer_value1"}]')
    )
    
    -- query
    select json_extract(js, '$.outer_key')
    from dataset,
         unnest(cast(json_parse(some_json) as array(json))) as t(js);
    

    Output:

         _col0
    ----------------
     "outer_value"
     "outer_value1"
    

    If it is just a string then you will need to use regular expressions. Something to get you started (the regular expression can change based on actual data):

    with dataset(some_str) as(
        values ('[{''main_key1'': {''key1'': value, ''key2'': value2}, ''outer_key'': outer_value},{''main_key2'': {''key1'': value, ''key2'': value2}, ''outer_key'': outer_value1}]')
    )
    
    select val
    from dataset,
         unnest(regexp_extract_all(some_str, '''outer_key''\s?:\s?\b(\w*)\b', 1)) as t(val);
    

    Output:

         val
    --------------
     outer_value
     outer_value1