AWS Athena parse JSON array

I had previously had asked a question, and it was answered (AWS Athena Parse array of JSON objects to rows), about parsing JSON arrays using Athena but running into a variation.

Using the example:

SELECT user_textarray
FROM "sample"."workdetail" 
where workid = '5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195'

The results returned as:

[{"userlist":"{'id': 'd87b002d-6c75-4c5a-b546-fe04cc939da9', 'name': 'John Smith'}"}, 
 {"userlist":"{'id': '41f20d65-c333-4fe5-bbe5-f9c63566cfc3', 'name': 'Larry Johnson'}"}, 
 {"userlist":"{'id': '18106aa2-e461-4ac5-b399-b2e209c0c341', 'name': 'Kim Jackson'}"}
]

What I'm trying to return is the list of id and name as rows related to the workid in the original query. I'm not sure why the JSON is formated this way and it comes from a 3rd party so can't make adjustments so needing to figure out how to parse the object within an object.

workid, id, name
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,d87b002d-6c75-4c5a-b546-fe04cc939da9,'John Smith'
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,41f20d65-c333-4fe5-bbe5-f9c63566cfc3,'Larry Johnson'
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195,18106aa2-e461-4ac5-b399-b2e209c0c341,'Kim Jackson'

I have tried variations of this but not working so trying to determine if I need to modify my 'with' statement to get to the object within the object or if on the select I need to further parse the object to get the elements I need.

with dataset as (workid, user_textarray
FROM "sample"."workdetail"
cross join unnest(user_textarray)
where workid = '5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195')
select workid,
       json_extract_scalar(json, '$.userlist.name') name
from dataset
, unnest(user_textarray) as t(json);

Solution

The problem is in your data, from the Presto/Trino point of view userlist contains a string, not a JSON object, moreover this string itself is not a valid JSON for it since it contains ' instead of '"' for props.

To "fix" this you can take the following steps (the only workaround I know):

Extract the userlist
Replace ' with " (some other JSON parsers will actually handle this "correctly" and will not require this step, but not on case of Trino/Presto)
Process new JSON as you need.

Something to get you started:

-- sample data
with dataset(workid, user_textarray) as (
values ('5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195', array['{"userlist":"{''id'': ''d87b002d-6c75-4c5a-b546-fe04cc939da9'', ''name'': ''John Smith''}"}',
 '{"userlist":"{''id'': ''41f20d65-c333-4fe5-bbe5-f9c63566cfc3'', ''name'': ''Larry Johnson''}"}',
 '{"userlist":"{''id'': ''18106aa2-e461-4ac5-b399-b2e209c0c341'', ''name'': ''Kim Jackson''}"}'
])
)

-- query
select workid,
       json_extract_scalar(replace(json_extract_scalar(json, '$.userlist'),'''', '"'), '$.name') name
from dataset,
     unnest(user_textarray) as t(json);

Output:

workid	name
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195	John Smith
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195	Larry Johnson
5bb0a33f-3ca6-4f9c-9676-0b4d62dbb195	Kim Jackson

Note that for your goal you can use CTE/subquery so you don't need to write the handling multiple times.