sql apache-spark apache-spark-sql aggregate multiple-columns

How to aggregate on multiple columns using SQL or spark SQL

I have following table:

Id col1 col2
1  a    1   
1  b    2   
1  c    3   
2  a    1   
2  e    3   
2  f    4

Expected output is:

Id col3
1  a1b2c3
2  a1e3f4

The aggregation computation involves 2 columns, is this supported in SQL?

Solution

In Spark SQL you can do it like this:

SELECT Id, aggregate(list, '', (acc, x) -> concat(acc, x)) col3
FROM (SELECT Id, array_sort(collect_list(concat(col1, col2))) list
      FROM df
      GROUP BY Id )

or in one select:

SELECT Id, aggregate(array_sort(collect_list(concat(col1, col2))), '', (acc, x) -> concat(acc, x)) col3
FROM df
GROUP BY Id

Higher-order aggregate function is used in this example.

aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.

Math.Sin() gives incorrect value
How to run my python script when the sunOS is start booting
Express-session: not resetting cookie expiration on each request
Getting a stack overflow exception when normalizing a vector
Edit default summary function in R gives error for multiple variables
What was a For loop? Why isn't it needed in R?
How to use download button in shiny and save results in various formats (csv, texte, pdf, spss...)?
Why are there two assignment operators, `<-` and `->` in R?
lm()$assign: what is it?
How to get the value of list(...) in R and S functions
Design matrix for MLM from library(lme4) with fixed and random effects
how to generate elements not included in my sample
Create a matrix with gradually changing values without a for loop
Emacs ESS and S-plus ( S+ ) 8.1 compatability
How to lag date-index in a time-series in R?
Nonlinear regression in R / S
Calling R from S-Plus?