Search code examples
apache-sparkpysparkapache-spark-sql

Pivot with custom column names in pyspark


I need to pivot the Table with custom column name's. Please see the table format below.

+-------+------+------+--------+
|balance| ex_cy|rp_prd|scenario|
+-------+------+------+--------+
|    100|2022Q2|202204|     abc|
|    200|2022Q2|202204|     abc|
|    300|2022Q2|202204|     abc|
+-------+------+------+--------+

I want to convert this into below format:

+------+------+--------+---+---+---+
| ex_cy|rp_prd|scenario| C1| C2| C3|
+------+------+--------+---+---+---+
|2022Q2|202204|     abc|100|200|300|
+------+------+--------+---+---+---+

Can anyone help on this please??


Solution

  • Lets create an additional id column to uniquely identify rows per 'ex_cy', 'rp_prd' and 'scenario', then do a groupby + pivot and aggregate balance with first

    cols = ['ex_cy', 'rp_prd', 'scenario']
    
    W = Window.partitionBy(*cols).orderBy('balance')
    id_ = F.concat(F.lit('C'), F.row_number().over(W))
    
    result = (
        df
        .withColumn('id', id_)
        .groupBy(*cols).pivot('id').agg(F.first('balance'))
    )
    

    Result

    +------+------+--------+---+---+---+
    | ex_cy|rp_prd|scenario| C1| C2| C3|
    +------+------+--------+---+---+---+
    |2022Q2|202204|     abc|100|200|300|
    +------+------+--------+---+---+---+