I have a nested array in athena which looks like this:
[
{org=[..],auth={..}},{org=[..],auth={..}},{org=[..],auth={..}}
]
both the array and its contents are optional. I would like to unnest the org
and the auth
into seperate columns, but Iam unsure how to do this.
I have tried (apologies if the attempt is stupid):
SELECT exploded.org FROM "db"."table" a
cross join UNNEST(a.column_name) as exploded limit 10;
but this crashes.
EDIT:
I made progress by reading the docs and now I have this working nicely:
SELECT
id,
j_titles.exploded.org
FROM "db"."table" a
cross join UNNEST(a.authorships) as j_titles(exploded) limit 100;
but this produced a table with id
and org
except that the explosion meant that every exploded item is written as a seperate row. So I have:
id | org
1, [{}]
1, [{}]
1, [{}]
2, [{}]
2, [{}]
and for auth
I have:
id | org
1, {}
1, {}
1, {}
2, {}
2, {}
However, what I was expecting was:
id | org/auth
1, [{},{},{}]
2, [{},{},{}]
3, [{},{},{}]
4, [{},{},{}]
5, [{},{},{}]
Asuming your data is ARRAY(ROW(...))
then you need to specify all the columns the row contains in the alias since Trino unpacks the ROW
type (though in Presto as far as I remember rows are not unpacked so only single alias is needed as exploded(row_col)
):
SELECT *
FROM UNNEST(
ARRAY[
ROW('Java', 1995),
ROW('SQL' , 1974)]
) as t(language, first_appeared_year);
Output:
language | first_appeared_year |
---|---|
Java | 1995 |
SQL | 1974 |
So in your case something like:
as exploded (org, auth)
Note that you can chain unnests and unnest several arrays at a time if needed. Something along these lines (not tested obviously, also note the short syntax for unnest
skipping the cross join
):
SELECT exploded.org
FROM "db"."table" a
, UNNEST(a.column_name) as exploded(orgs, auths)
, UNNEST(orgs, auths) as exploded2(org_col, auth_col) ;
UPD
Sample with your data:
-- sample data, ignore the json handling it is done to simplify data creation
with dataset (json_col) as (values (
json '[
{
"institutions": [
{
"country_code":"FR",
"ror":"https://ror.org/04xmteb38",
"id":"https://openalex.org/I4210158291",
"type":"facility",
"display_name":"Laboratoire Jacques-Louis Lions"
},
{
"country_code":"FR2",
"ror":"test2",
"id":"test2",
"type":"facility2",
"display_name":"Laboratoire SO Answer"
}
],
"author":{
"orcid":"https://orcid.org/0000-0001-5975-9271",
"id":"https://openalex.org/A5055801083",
"display_name":"Laurent Boudin"
},
"raw_affiliation_string":"Laboratoire Jacques-Louis Lions",
"author_position":"first",
"raw_affiliation_strings":[
"Laboratoire Jacques-Louis Lions"
],
"is_corresponding":"false",
"countries":[
"FR"
]
}
]')),
-- substitute for your table
some_table as (
select cast(json_col as
array(
row(
institutions array(row(country_code varchar, ror varchar, id varchar, type varchar, display_name varchar)),
author row(orcid varchar, id varchar, display_name varchar),
raw_affiliation_string varchar,
author_position varchar,
raw_affiliation_strings array(varchar),
is_corresponding boolean,
countries array(varchar))))
as column_name
from dataset
)
-- query
select t.author, institution.*
from some_table
, unnest(column_name) as t(institutions, author, raw_affiliation_string, author_position, raw_affiliation_strings, is_corresponding, countries)
, unnest(institutions) as institution(country_code, ror, id, "type", display_name)
;
Notes:
depending on the Athena version (Presto and Trino as far as I remember handle unnesting arrays of rows differently, the former is not unpacking rows into columns on unnest) you might unnest into a single column which will contain a row type:
select row1.author author, row2 institution
from some_table
, unnest(column_name) as t(row1)
, unnest(row1.institutions) as t1(row2)
If some data actually has empty arrays but you want to keep the related data you will need to switch to handle that for example by switching from cross join
to left join ... on true
see this answer. For example:
from some_table
left join unnest(column_name) as ... on true
left join unnest(...) as ... on true