I would like to sum over a partition without having as a result, a cumulative sum but rather, the whole sum on each partition:
From:
Category A | Category B | Value |
---|---|---|
1 | 2 | 100 |
1 | 2 | 150 |
2 | 1 | 110 |
2 | 2 | 200 |
I would like to have:
Category A | Category B | Value | Sum |
---|---|---|---|
1 | 2 | 100 | 250 |
1 | 2 | 150 | 250 |
2 | 1 | 110 | 110 |
2 | 2 | 200 | 200 |
With:
from pyspark.sql.functions import sum
from pyspark.sql.window import Window
windowSpec = Window.partitionBy(["Category A","Category B"])
df = df.withColumn('sum', sum(df.Value).over(windowSpec))
I don't obtain the result I want,I obtain the cumulative sum:
Category A | Category B | Value | Sum |
---|---|---|---|
1 | 2 | 100 | 100 |
1 | 2 | 150 | 250 |
2 | 1 | 110 | 110 |
2 | 2 | 200 | 200 |
How could I proceed? Thank you
When defining a window you can specify the range for the window.
You can specify the range (Window.unboundedPreceding, Window.unboundedFollowing)
to sum over all rows within each partition irrespective of the ordering of the rows:
windowSpec = Window.partitionBy(["Category A","Category B"])\
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn('sum', F.sum(df.Value).over(windowSpec))\
.orderBy("Category A", "Category B").show()
prints
+----------+----------+-----+-----+
|Category A|Category B|Value| sum|
+----------+----------+-----+-----+
| 1| 2| 100|250.0|
| 1| 2| 150|250.0|
| 2| 1| 110|110.0|
| 2| 2| 200|200.0|
+----------+----------+-----+-----+