Search code examples
apache-sparkpysparkpivotmultiple-columnspyspark-pandas

Create multiple columns by pivoting even when pivoted value doesn't exist


I have a PySpark df:

Store_ID Category ID Sales
1 A 123 23
2 A 123 45
1 A 234 67
1 B 567 78
2 B 567 34
3 D 789 12
1 A 890 12

Expected:

Store_ID A_ID B_ID C_ID D_ID Sales_A Sales_B Sales_C Sales_D
1 3 1 0 0 102 78 0 0
2 1 1 0 0 45 34 0 0
3 0 0 0 1 0 0 0 12

I am able to transform this way using SQL (created a temp view):

SELECT Store_Id,
       SUM(IF(Category='A',Sales,0)) AS Sales_A,
       SUM(IF(Category='B',Sales,0)) AS Sales_B,
       SUM(IF(Category='C',Sales,0)) AS Sales_C,
       SUM(IF(Category='D',Sales,0)) AS Sales_D,
       COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
       COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
       COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
       COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;

How do we achieve the same in PySpark using native functions as its much faster?


Solution

  • This operation is called pivoting.

    • a couple of aggregations, since you need both, count of ID and sum of Sales
    • alias for aggregations, for changing column names
    • providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.

    Input:

    from pyspark.sql import functions as F
    df = spark.createDataFrame(
        [(1, 'A', 123, 23),
         (2, 'A', 123, 45),
         (1, 'A', 234, 67),
         (1, 'B', 567, 78),
         (2, 'B', 567, 34),
         (3, 'D', 789, 12),
         (1, 'A', 890, 12)],
        ['Store_ID', 'Category', 'ID', 'Sales'])
    

    Script:

    df = (df
        .groupBy('Store_ID')
        .pivot('Category', ['A', 'B', 'C', 'D'])
        .agg(
            F.countDistinct('ID').alias('ID'),
            F.sum('Sales').alias('Sales'))
        .fillna(0))
    df.show()
    # +--------+----+-------+----+-------+----+-------+----+-------+
    # |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
    # +--------+----+-------+----+-------+----+-------+----+-------+
    # |       1|   3|    102|   1|     78|   0|      0|   0|      0|
    # |       3|   0|      0|   0|      0|   0|      0|   1|     12|
    # |       2|   1|     45|   1|     34|   0|      0|   0|      0|
    # +--------+----+-------+----+-------+----+-------+----+-------+