Search code examples
pysparksumwindowpartition

Pyspark: not cumulative sum over partition


I would like to sum over a partition without having as a result, a cumulative sum but rather, the whole sum on each partition:

From:

Category A Category B Value
1 2 100
1 2 150
2 1 110
2 2 200

I would like to have:

Category A Category B Value Sum
1 2 100 250
1 2 150 250
2 1 110 110
2 2 200 200

With:

from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))

I don't obtain the result I want,I obtain the cumulative sum:

Category A Category B Value Sum
1 2 100 100
1 2 150 250
2 1 110 110
2 2 200 200

How could I proceed? Thank you


Solution

  • When defining a window you can specify the range for the window.

    You can specify the range (Window.unboundedPreceding, Window.unboundedFollowing) to sum over all rows within each partition irrespective of the ordering of the rows:

    windowSpec = Window.partitionBy(["Category A","Category B"])\
        .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
    df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
        .orderBy("Category A", "Category B").show()
    

    prints

    +----------+----------+-----+-----+
    |Category A|Category B|Value|  sum|
    +----------+----------+-----+-----+
    |         1|         2|  100|250.0|
    |         1|         2|  150|250.0|
    |         2|         1|  110|110.0|
    |         2|         2|  200|200.0|
    +----------+----------+-----+-----+