Search code examples
pysparkwindow

Window object in pyspark


If i define a window object like this :

w_score = Window().partitionBy("HUB_ID").orderBy(F.col("coefficient").desc(), F.col("LastUpdateDate").desc())

And another one, like this :

w_score_prim = w_score.orderBy(F.col("Primary"))

The new object, w_score_prim is equivalent to which one ? :

Window().partitionBy("HUB_ID").orderBy(F.col("coefficient").desc(), F.col("LastUpdateDate").desc(), F.col("Primary"))
or 
Window().partitionBy("HUB_ID").orderBy(F.col("Primary"))

Solution

  • In multiple orderBy() call, last orderBy overwrites all previously called orderBy. So to answer your question - it would be

    Window().partitionBy("HUB_ID").orderBy(F.col("Primary"))
    

    It can be better illustrated using below 3 use cases:

    • case 1: If you want to order by column a (ascending) and then b (ascending) try this example.
    • case 2: If you call orderBy() twice (or multiple times) then it considers the last call as the final ordering call. So in this example it orders by column b 'only'.
    • case 3: Same (as case 2) happens here. It orders by column b and has ignored column a ordering.
    import pyspark.sql.functions as f
    from pyspark.sql.window import Window
    
    df = sc.parallelize([
        [1, 3, 1, 10],
        [1, 0, 2, 20],
        [3, 2, 3, 30],
        [2, 1, 6, 40],
        [2, 1, 4, 50],
        [2, 2, 5, 60]
    ]).toDF(('id', 'a', 'b', 'val'))
    
    #case 1
    w = Window.partitionBy().partitionBy("id").orderBy(f.col("a"), f.col("b"))
    df1 = df.withColumn("c", f.sum(f.col("val")).over(w))
    df1.show()
    
    #case 2
    w_2 = Window.partitionBy().partitionBy("id").orderBy(f.col("a")).orderBy(f.col("b"))
    df2 = df.withColumn("c", f.sum(f.col("val")).over(w_2))
    df2.show()
    
    #case 3
    w_3 = Window.partitionBy().partitionBy("id").orderBy(f.col("a"))
    w_3 = w_3.orderBy("b")
    df3 = df.withColumn("c", f.sum(f.col("val")).over(w_3))
    df3.show()
    

    Output is:

    +---+---+---+---+---+
    | id|  a|  b|val|  c|
    +---+---+---+---+---+
    |  1|  0|  2| 20| 20|
    |  1|  3|  1| 10| 30|
    |  3|  2|  3| 30| 30|
    |  2|  1|  4| 50| 50|
    |  2|  1|  6| 40| 90|
    |  2|  2|  5| 60|150|
    +---+---+---+---+---+
    
    +---+---+---+---+---+
    | id|  a|  b|val|  c|
    +---+---+---+---+---+
    |  1|  3|  1| 10| 10|
    |  1|  0|  2| 20| 30|
    |  3|  2|  3| 30| 30|
    |  2|  1|  4| 50| 50|
    |  2|  2|  5| 60|110|
    |  2|  1|  6| 40|150|
    +---+---+---+---+---+
    
    +---+---+---+---+---+
    | id|  a|  b|val|  c|
    +---+---+---+---+---+
    |  1|  3|  1| 10| 10|
    |  1|  0|  2| 20| 30|
    |  3|  2|  3| 30| 30|
    |  2|  1|  4| 50| 50|
    |  2|  2|  5| 60|110|
    |  2|  1|  6| 40|150|
    +---+---+---+---+---+