I'm struggling with a query to solve this: I have this data in athena AWS
name | hobby | food | other_col
cris | sports | pasta | sdsd
cris | music | pizza | qfrfe
cris | sports | pizza | dcfrfe
cris | sports | pizza | koioio
arnold | sports | pasta | joiuhiu
arnold | art | salad | ojouju
arnold | art | pasta | jiojo
jenny | dance | sushi | sdkwdk
jenny | dance | sushi | lkjlj
jenny | ski | pizza | sdkwdk
jenny | dance | pasta | jlkjlkj
And i need to get the most frequents values in columns group by name, something like this:
name | hobby | food |
cris | sports | pizza |
arnold | art | pasta |
jenny | dance | sushi |
Anyone could help me please?
You can use GROUPING SETS
to determine the max counts in every group (name-hobby and name-food) and then use MAX_BY
to get the maximum:
-- sample data
with dataset(name, hobby, food) as (
values ('cris' , 'sports', 'pasta'),
('cris' , 'music' , 'pizza'),
('cris' , 'sports', 'pizza'),
('cris' , 'sports', 'pizza'),
('arnold', 'sports', 'pasta'),
('arnold', 'art' , 'salad'),
('arnold', 'art' , 'pasta'),
('jenny' ,'dance' ,'sushi'),
('jenny' ,'dance' ,'sushi'),
('jenny' , 'ski' ,'pizza'),
('jenny' ,'dance' ,'pasta')
)
-- query
select name,
max_by(hobby, if(hobby is not null, cnt)) hobby,
max_by(food, if(food is not null, cnt)) food
from (select name,
hobby,
food,
count(*) cnt
from dataset
group by grouping sets ((name, hobby), (name, food)))
group by name;
Output:
name | hobby | food |
---|---|---|
jenny | dance | sushi |
cris | sports | pizza |
arnold | art | pasta |
Another approach - use histogram
and some map functions magic to determine maximum element in the map:
-- query
select name,
map_keys(
map_filter(
h,
(k, v) -> v = array_max(map_values(h))))[1] hobby,
map_keys(
map_filter(
f,
(k, v) -> v = array_max(map_values(f))))[1] food
from (select name,
histogram(hobby) h,
histogram(food) f
from dataset
group by name);