Search code examples
sqlamazon-athenaprestotrino

SQL most frequent values on multiple columns group by other column


I'm struggling with a query to solve this: I have this data in athena AWS

 name   | hobby    | food   | other_col
 cris   | sports   | pasta  |  sdsd
 cris   | music    | pizza  |  qfrfe
 cris   | sports   | pizza  |  dcfrfe
 cris   | sports   | pizza  |  koioio
 arnold | sports   | pasta  |  joiuhiu
 arnold | art      | salad  |  ojouju
 arnold | art      | pasta  |  jiojo
 jenny  | dance    | sushi  |  sdkwdk
 jenny  | dance    | sushi  |  lkjlj
 jenny  |  ski     | pizza  |  sdkwdk
 jenny  | dance    | pasta  |  jlkjlkj

And i need to get the most frequents values in columns group by name, something like this:

name   | hobby   | food  |
cris   | sports  | pizza |
arnold | art     | pasta |
jenny  | dance   | sushi |

Anyone could help me please?


Solution

  • You can use GROUPING SETS to determine the max counts in every group (name-hobby and name-food) and then use MAX_BY to get the maximum:

    -- sample data
    with dataset(name, hobby, food) as (
        values ('cris'  , 'sports', 'pasta'),
            ('cris'  , 'music' , 'pizza'),
            ('cris'  , 'sports', 'pizza'),
            ('cris'  , 'sports', 'pizza'),
            ('arnold', 'sports', 'pasta'),
            ('arnold', 'art'   , 'salad'),
            ('arnold', 'art'   , 'pasta'),
            ('jenny' ,'dance'  ,'sushi'),
            ('jenny' ,'dance'  ,'sushi'),
            ('jenny' , 'ski'   ,'pizza'),
            ('jenny' ,'dance'  ,'pasta')
    )
    
    -- query
    select name,
           max_by(hobby, if(hobby is not null, cnt)) hobby,
           max_by(food, if(food is not null, cnt)) food
    from (select name,
                 hobby,
                 food,
                 count(*) cnt
          from dataset
          group by grouping sets ((name, hobby), (name, food)))
    group by name;
    

    Output:

    name hobby food
    jenny dance sushi
    cris sports pizza
    arnold art pasta

    Another approach - use histogram and some map functions magic to determine maximum element in the map:

    -- query
    select name,
           map_keys(
                   map_filter(
                           h,
                           (k, v) -> v = array_max(map_values(h))))[1] hobby,
           map_keys(
                   map_filter(
                           f,
                           (k, v) -> v = array_max(map_values(f))))[1] food
    from (select name,
                 histogram(hobby) h,
                 histogram(food)  f
          from dataset
          group by name);