I have a json array structure similar to the following:
[elem1
elem2
attr
-- code1
-- code2
]
I was able to upload these jsons and preserved their format correctly, and used AWS Glue to crawl the json and I got all of the proper data structure elements (attr has a lot of sub elements that were correctly extracted).
When I go to Athena; however, and I do a select *
, I get the 3 base columns (elem1, elem2, attr), but the rows have each of the jsons as a whole:
elem1 |elem2
---------------------------------------------------------------------------------------------------
{elem1:"a",elem2:"b",attr:{code1:"1",code2:"2"}} |{elem1:"a",elem2:"b",attr:{code1:"1",code2:"2"}}
(with attr as a column name at the end)
Interestingly enough, when I relationalized the data and made it into a parquet file, all of the fields were preserved, but I am worried of some kind of data loss, since there are multiple jsons stored in one row, so there may be some kind of overwriting issue.
Any ideas on how to proceed? I've been trying to find ways to query the json on Athena, but it may just be a problem with the json itself (I downloaded the json and looked at the structure and it looks fine visually).
The reason why the JSON format that you are using was not working is because of this. The behaviour is expected and for your JSON file to work properly each record has to be present on separate line. Also refer to this talks about the requirement To parse JSON-encoded data in Athena, make sure that each JSON document is on its own line, separated by a new line.