Search code examples
sqlapache-sparkapache-spark-sqlaggregatemultiple-columns

How to aggregate on multiple columns using SQL or spark SQL


I have following table:

Id col1 col2
1  a    1   
1  b    2   
1  c    3   
2  a    1   
2  e    3   
2  f    4  

Expected output is:

Id col3
1  a1b2c3
2  a1e3f4

The aggregation computation involves 2 columns, is this supported in SQL?


Solution

  • In Spark SQL you can do it like this:

    SELECT Id, aggregate(list, '', (acc, x) -> concat(acc, x)) col3
    FROM (SELECT Id, array_sort(collect_list(concat(col1, col2))) list
          FROM df
          GROUP BY Id )
    

    or in one select:

    SELECT Id, aggregate(array_sort(collect_list(concat(col1, col2))), '', (acc, x) -> concat(acc, x)) col3
    FROM df
    GROUP BY Id
    

    Higher-order aggregate function is used in this example.

    aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.