Search code examples
sqlddlamazon-athena

Athena nested Struct Querying - how to query Value_counts in SQL


I have a big nested struct in AWS Athena. Here's one column named "petowners" in the table:

{_id=5e6b531a412345e0e86aeae0, status=NotAnalyzed, animalcategories=[{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}], ...many-other-values}

I'm looking for:

  1. The equivalent of the python function value_counts in the column. Meaning I'm looking for the SQL Athena command that will output for this row: [mammals:1, birds:2, UnknownField:4]
  2. A way to query aggregation - to create a histogram of total number of pets per owner for this row = 7
  3. How many pet owners have an UnknownField in 'animalycategories'
  4. How many animal types are there in the entire table?

Solution

  • Here's the beginning of the solution: Let's call the table "entire_table"

     SELECT t.entire_table._id,
             t.petowners.animalcategories,
             ac.categoryname,
             ac.matches
    FROM entire_table t, UNNEST(t.petowners.animalcategories) AS t(ac)
    

    This query will output a table with columns named "categoryname" and "matches", where each row is duplicated for as many category names as there are for each user_id:

    | _id                      | animalcategories                                                                                              | categoryname | matches |
    |--------------------------|---------------------------------------------------------------------------------------------------------------|--------------|---------|
    | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | mammals      | 1       |
    | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | birds        | 2       |
    | 5e6b531a412345e0e86aeae0 | [{categoryname=mammals, matches=1}, {categoryname=birds, matches=2}, {categoryname= UnknownField, matches=4}] | UnknownField | 4       |
    

    And here are the most relevant links by order of importance that enabled the solution:

    Going down the rabbit hole I encountered some less helpful links that I find worth mentioning, for the sake of this thorough review I'll add them here:

    I hope someone someday will find this post useful and get themselves a shortcut from a few hours of browsing the web for answers I had to go through. Good luck.