Search code examples
pythonsqlpysparkdatabricks

Spark SQL - Pivot and concatenation


I am working with spark sql and have a requirement to pivot and concatenate the data. My input data looks like

ID Quantity Location
1 10 US
2 20 UK
2 5 CA
2 20 US
3 15 US
3 20 CA
4 25 US
4 10 CA

My Output needs to be able to sum the quantity on ID and store the location with comma separated values.

ID Quantity Location
1 10 US
2 45 UK,CA,US
3 35 US,CA
4 35 US,CA

I tried various methods in spark sql and pyspark available here, but was not able to achieve my results


Solution

  • Sql version below if data are in table, using collect_list with concat_ws...

    %sql
    with table1 as (
      select 1 as id, 10 as quantity, 'US' as location union all 
      select 2, 20, 'UK' union all
      select 2, 5, 'CA' union all
      select 2, 20, 'US' union all
      select 3, 15, 'US' union all
      select 3, 20, 'CA' union all
      select 4, 25, 'US' union all
      select 4, 10, 'CA'
    )
    select 
     id, 
     sum(quantity) as quantity, 
     concat_ws(',', collect_list(location)) as location
    from table1
    group by id
    order by id
    
    id quantity location
    1 10 US
    2 45 CA,UK,US
    3 35 US,CA
    4 35 US,CA