Let's say I have a hive table that looks like this:
ID event order_num
------------------------
A red 2
A blue 1
A yellow 3
B yellow 2
B green 1
...
I'm trying to use collect_list to generate a list of events for each ID. So something like the following:
SELECT ID,
collect_list(event) as events_list,
FROM table
GROUP BY ID;
However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:
ID events_list
------------------------
A ["blue","red","yellow"]
B ["green","red"]
I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?
Thanks!
So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:
WITH table1 AS (
SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)
-- Collect it
SELECT subquery.ID,
collect_list(subquery.event) as events_list
FROM (
SELECT
table1.ID,
table1.event,
table1.order_num
FROM table1
DISTRIBUTE BY
table1.ID
SORT BY
table1.ID,
table1.order_num
) subquery
GROUP BY subquery.ID;