Search code examples
sqlamazon-athenaprestotrino

unnesting an array into two other array sql (athena/presto)


I have a nested array in athena which looks like this:

[
{org=[..],auth={..}},{org=[..],auth={..}},{org=[..],auth={..}}
]

both the array and its contents are optional. I would like to unnest the org and the auth into seperate columns, but Iam unsure how to do this.

I have tried (apologies if the attempt is stupid):

SELECT exploded.org FROM "db"."table" a
cross join UNNEST(a.column_name) as exploded limit 10;

but this crashes.

EDIT:

I made progress by reading the docs and now I have this working nicely:

SELECT
  id,
  j_titles.exploded.org
FROM "db"."table" a
cross join UNNEST(a.authorships) as j_titles(exploded) limit 100;

but this produced a table with id and org except that the explosion meant that every exploded item is written as a seperate row. So I have:

id | org
1, [{}]
1, [{}]
1, [{}]
2, [{}]
2, [{}]

and for auth I have:

id | org
1, {}
1, {}
1, {}
2, {}
2, {}

However, what I was expecting was:

id | org/auth
1, [{},{},{}]
2, [{},{},{}]
3, [{},{},{}]
4, [{},{},{}]
5, [{},{},{}]

Solution

  • Asuming your data is ARRAY(ROW(...)) then you need to specify all the columns the row contains in the alias since Trino unpacks the ROW type (though in Presto as far as I remember rows are not unpacked so only single alias is needed as exploded(row_col)):

    SELECT *
    FROM UNNEST(
            ARRAY[
                ROW('Java',  1995),
                ROW('SQL' , 1974)]
    ) as t(language, first_appeared_year);
    

    Output:

    language first_appeared_year
    Java 1995
    SQL 1974

    So in your case something like:

    as exploded (org, auth)
    

    Note that you can chain unnests and unnest several arrays at a time if needed. Something along these lines (not tested obviously, also note the short syntax for unnest skipping the cross join):

    SELECT exploded.org 
    FROM "db"."table" a
    , UNNEST(a.column_name) as exploded(orgs, auths)
    , UNNEST(orgs, auths) as exploded2(org_col, auth_col) ;
    

    UPD

    Sample with your data:

    -- sample data, ignore the json handling it is done to simplify data creation
    with dataset (json_col) as (values (
        json '[
       {
          "institutions": [
             {
                "country_code":"FR",
                "ror":"https://ror.org/04xmteb38",
                "id":"https://openalex.org/I4210158291",
                "type":"facility",
                "display_name":"Laboratoire Jacques-Louis Lions"
             },
             {
                "country_code":"FR2",
                "ror":"test2",
                "id":"test2",
                "type":"facility2",
                "display_name":"Laboratoire SO Answer"
             }
          ],
          "author":{
             "orcid":"https://orcid.org/0000-0001-5975-9271",
             "id":"https://openalex.org/A5055801083",
             "display_name":"Laurent Boudin"
          },
          "raw_affiliation_string":"Laboratoire Jacques-Louis Lions",
          "author_position":"first",
          "raw_affiliation_strings":[
             "Laboratoire Jacques-Louis Lions"
          ],
          "is_corresponding":"false",
          "countries":[
             "FR"
          ]
       }
    ]')),
    -- substitute for your table
    some_table as (
        select cast(json_col as
            array(
                row(
                    institutions array(row(country_code varchar, ror varchar, id varchar, type varchar, display_name varchar)),
                    author row(orcid varchar, id varchar, display_name varchar),
                    raw_affiliation_string varchar,
                    author_position varchar,
                    raw_affiliation_strings array(varchar),
                    is_corresponding boolean,
                    countries array(varchar))))
            as column_name
        from dataset
    )
    
    -- query
    select t.author, institution.*
    from some_table
    , unnest(column_name) as t(institutions, author, raw_affiliation_string, author_position, raw_affiliation_strings, is_corresponding, countries)
    , unnest(institutions) as institution(country_code, ror, id, "type", display_name)
    ;
    

    Notes:

    • depending on the Athena version (Presto and Trino as far as I remember handle unnesting arrays of rows differently, the former is not unpacking rows into columns on unnest) you might unnest into a single column which will contain a row type:

      select row1.author author, row2 institution
      from some_table
      , unnest(column_name) as t(row1)
      , unnest(row1.institutions) as t1(row2)
      
    • If some data actually has empty arrays but you want to keep the related data you will need to switch to handle that for example by switching from cross join to left join ... on true see this answer. For example:

      from some_table
      left join unnest(column_name) as ... on true
      left join unnest(...) as ... on true