I have data in XML form which I have converted in JSON format through glue crawler. The problem is in writing the DDL statement for a table in Athena as you can see below there is a contact attribute in JSON data. Somewhere it is a structure (single instance) and somewhere it is in array form (multiple instances). I am sharing the DDL statements below as well for each type.
JSON Data Type 1
"ContactList": {
"Contact": {
}
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: struct<
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
JSON Data Type 2
"ContactList": {
"Contact": [
{},
{}
]
}
Athena DDL Statement
CREATE EXTERNAL TABLE IF NOT EXISTS table_name (
ContactList: struct<
Contact: array <
struct<
>
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3_bucket_path'
TBLPROPERTIES ('has_encrypted_data'='false')
I am able to write DDL statement for one case at a time only and it work perfectly for individual type. My question is how we can write DDL statements so it can cater to both types either it is struct or array. Thanks in advance.
The way you solve this in Athena is that you use the string
type for the Contact
field of the ContactList
column, and then JSON functions in your queries.
When you query you can for example do (assuming contacts have a "name" field):
SELECT
COALESCE(
json_extract_scalar(ContactList.Contact, '$.name[0]'),
json_extract_scalar(ContactList.Contact, '$.name')
) AS name
FROM table_name
This uses json_extract_scalar
which parses a string as JSON and then extracts a value using a JSONPath expression. COALESCE
picks the first non-null value, so if the first JSONPath expression does not yield any value (because the property is not an array), the second is attempted.