The dataset looks like this:
id | result | rank |
---|---|---|
001 | pass | 2 |
002 | fail | 3 |
001 | fail | 1 |
002 | pass | 1 |
What I want to do: group the dataset by id and concatenate the results in ascending order of rank column.
id | results |
---|---|
001 | fail-pass |
002 | pass-fail |
As the other column's order is involved, the concat_ws('-',collect_set(result))
function cannot fulfill my thought.
Are there any built-in functions to help me achieve this, or writing a UDF seems the only solution?
In a subquery before collect_set, distribute by id and sort by id, rank. Dataset will be distributed between reducers by id and sorted by rank before aggregation. See comments in the code.
Demo:
with demo_dataset as ( --Use your table instead of this CTE
select stack(4,
'001' , 'pass', 2,
'002' , 'fail', 3,
'001' , 'fail', 1,
'002' , 'pass', 1
) as (id,result,rank)
)
select id, concat_ws('-',collect_set(result))
from
(
select t.*
from demo_dataset t
distribute by id --Distribute by grouping column
sort by id, rank --Sort in required order
) s
group by id
Result:
id results
001 fail-pass
002 pass-fail
Now if you change SORT: sort by id, rank desc
you will get results ordered differently