I have defined a Hive table where a single column contains JSON text:
CREATE EXTERNAL TABLE IF NOT EXISTS my.rawdata (
json string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = '\n',
'quoteChar' = '\0',
'escapeChar' = '\r'
)
STORED AS TEXTFILE
LOCATION 's3://mydata/';
Is there a Presto/Athena query that can list out all field names that occur within the JSON and their frequency (i.e. total number of times the attribute appears in the table)?
Use the JSON functions to parse the JSON and turn it into a map. Then extract the keys and unnest them. Finally, use a normal SQL aggregation:
SELECT key, count(*)
FROM (
SELECT map_keys(cast(json_parse(json) AS map(varchar, json))) AS keys
FROM rawdata
)
CROSS JOIN UNNEST (keys) AS t (key)
GROUP BY key