I am working with spark sql and have a requirement to pivot and concatenate the data. My input data looks like
ID | Quantity | Location |
---|---|---|
1 | 10 | US |
2 | 20 | UK |
2 | 5 | CA |
2 | 20 | US |
3 | 15 | US |
3 | 20 | CA |
4 | 25 | US |
4 | 10 | CA |
My Output needs to be able to sum the quantity on ID and store the location with comma separated values.
ID | Quantity | Location |
---|---|---|
1 | 10 | US |
2 | 45 | UK,CA,US |
3 | 35 | US,CA |
4 | 35 | US,CA |
I tried various methods in spark sql and pyspark available here, but was not able to achieve my results
Sql version below if data are in table, using collect_list
with concat_ws
...
%sql
with table1 as (
select 1 as id, 10 as quantity, 'US' as location union all
select 2, 20, 'UK' union all
select 2, 5, 'CA' union all
select 2, 20, 'US' union all
select 3, 15, 'US' union all
select 3, 20, 'CA' union all
select 4, 25, 'US' union all
select 4, 10, 'CA'
)
select
id,
sum(quantity) as quantity,
concat_ws(',', collect_list(location)) as location
from table1
group by id
order by id
id | quantity | location |
---|---|---|
1 | 10 | US |
2 | 45 | CA,UK,US |
3 | 35 | US,CA |
4 | 35 | US,CA |