Search code examples
sqlarrayshiveconcatenationhiveql

How to concat one column by order after group by?


The dataset looks like this:

id result rank
001 pass 2
002 fail 3
001 fail 1
002 pass 1

What I want to do: group the dataset by id and concatenate the results in ascending order of rank column.

id results
001 fail-pass
002 pass-fail

As the other column's order is involved, the concat_ws('-',collect_set(result))function cannot fulfill my thought.

Are there any built-in functions to help me achieve this, or writing a UDF seems the only solution?


Solution

  • In a subquery before collect_set, distribute by id and sort by id, rank. Dataset will be distributed between reducers by id and sorted by rank before aggregation. See comments in the code.

    Demo:

    with demo_dataset as ( --Use your table instead of this CTE
    select stack(4,
    '001' , 'pass', 2,
    '002' , 'fail', 3,
    '001' , 'fail', 1,
    '002' , 'pass', 1
    ) as (id,result,rank)
    )
    
    select id, concat_ws('-',collect_set(result))
    from
    (
    select t.* 
      from demo_dataset t
    distribute by id   --Distribute by grouping column
    sort by id, rank   --Sort in required order
    ) s
    group by id
    

    Result:

    id  results
    001 fail-pass
    002 pass-fail
    

    Now if you change SORT: sort by id, rank desc you will get results ordered differently