Search code examples
hivehiveql

Sorting within collect_list() in hive


Let's say I have a hive table that looks like this:

ID    event    order_num
------------------------
A      red         2
A      blue        1
A      yellow      3
B      yellow      2
B      green       1
...

I'm trying to use collect_list to generate a list of events for each ID. So something like the following:

SELECT ID, 
collect_list(event) as events_list,
FROM table
GROUP BY ID;

However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:

ID    events_list
------------------------
A      ["blue","red","yellow"]
B      ["green","red"]

I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?

Thanks!


Solution

  • So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:

    WITH table1 AS (
        SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
        SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
        SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
        SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
        SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
    )
    
    -- Collect it
    SELECT subquery.ID, 
    collect_list(subquery.event) as events_list
    FROM (
    SELECT
            table1.ID,
            table1.event,
            table1.order_num
        FROM table1
        DISTRIBUTE BY
            table1.ID
        SORT BY
            table1.ID,
            table1.order_num
    ) subquery
    GROUP BY subquery.ID;