Search code examples
apache-sparkapache-spark-sqlpyspark

Get the max value over the window in pyspark


I am getting the maximum value over a specific window in pyspark. But what is returned from the method is not the expected.

Here comes my codes:

test = spark.createDataFrame(DataFrame({'grp': ['a', 'a', 'b', 'b'], 'val': [2, 3, 3, 4]}))
win = Window.partitionBy('grp').orderBy('val')
test = test.withColumn('row_number', F.row_number().over(win))
test = test.withColumn('max_row_number', F.max('row_number').over(win))
display(test)

And the output is:

enter image description here

I expected it would return 2 for both group "a" and group "b" but it didn't.

Anyone has ideas on this problem? Thanks a lot!


Solution

  • The problem here is with the frame for the max function. If you order the window as you are doing the frame is going to be Window.unboundedPreceding, Window.currentRow. So you can define another window where you drop the order (because the max function doesn't need it):

    w2 = Window.partitionBy('grp')
    

    You can see that in PySpark docs:

    Note When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.