I am getting the maximum value over a specific window in pyspark. But what is returned from the method is not the expected.
Here comes my codes:
test = spark.createDataFrame(DataFrame({'grp': ['a', 'a', 'b', 'b'], 'val': [2, 3, 3, 4]}))
win = Window.partitionBy('grp').orderBy('val')
test = test.withColumn('row_number', F.row_number().over(win))
test = test.withColumn('max_row_number', F.max('row_number').over(win))
display(test)
And the output is:
I expected it would return 2 for both group "a" and group "b" but it didn't.
Anyone has ideas on this problem? Thanks a lot!
The problem here is with the frame for the max
function. If you order the window as you are doing the frame is going to be Window.unboundedPreceding, Window.currentRow
. So you can define another window where you drop the order (because the max function doesn't need it):
w2 = Window.partitionBy('grp')
You can see that in PySpark docs:
Note When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.