json amazon-web-services amazon-athena presto

Presto/Athena - query to discover JSON attribute frequencies?

I have defined a Hive table where a single column contains JSON text:

CREATE EXTERNAL TABLE IF NOT EXISTS my.rawdata (
  json string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = '\n',
   'quoteChar' = '\0',
   'escapeChar' = '\r'
)
STORED AS TEXTFILE
LOCATION 's3://mydata/';

Is there a Presto/Athena query that can list out all field names that occur within the JSON and their frequency (i.e. total number of times the attribute appears in the table)?

Solution

Use the JSON functions to parse the JSON and turn it into a map. Then extract the keys and unnest them. Finally, use a normal SQL aggregation:

SELECT key, count(*)
FROM (
  SELECT map_keys(cast(json_parse(json) AS map(varchar, json))) AS keys
  FROM rawdata
)
CROSS JOIN UNNEST (keys) AS t (key)
GROUP BY key