Search code examples
pysparkcountsql-function

Count of rows in pyspark dataframe over a window


I have a pyspark dataframe with below data

[The columns #, A, B and C is my data.

My code:

W = Window.partitionBy("A").orderBy(col("C"))
main_df = main_df.withColumn("cnt", F.count("B").over(W))

Is there something wrong in how I have used the count function? What can I do so the values in column 'Actual' match with 'Expecting'? I see two issues with my output -

  1. the count starts at 1 when it should start from 0
  2. for each group the last count is getting assigned instead of the first

Can anyone help me resolve this issue?


Solution

  • Try with dense_rank window function and substract 1 from the dense_rank value.

    Example:

    df = spark.createDataFrame([('0','king','4/30/2022'),('1','king','5/17/2022'),('2','king','5/26/2022'),('3','king','6/5/2022'),('4','king','7/8/2022'),('5','king','7/9/2022'),('6','king','4/21/2023'),('7','king','4/21/2023'),('8','king','4/22/2023'),('9','king','8/22/2023')],['#','A','C'])
    W = Window.partitionBy("A").orderBy(col("C"))
    
    df.withColumn("c",to_date(col("C"),"M/dd/yyyy")).\
      withColumn("Expecting",dense_rank().over(W)-1).show()
    #+---+----+----------+---------+
    #|  #|   A|         c|Expecting|
    #+---+----+----------+---------+
    #|  0|king|2022-04-30|        0|
    #|  1|king|2022-05-17|        1|
    #|  2|king|2022-05-26|        2|
    #|  3|king|2022-06-05|        3|
    #|  4|king|2022-07-08|        4|
    #|  5|king|2022-07-09|        5|
    #|  6|king|2023-04-21|        6|
    #|  7|king|2023-04-21|        6|
    #|  8|king|2023-04-22|        7|
    #|  9|king|2023-08-22|        8|
    #+---+----+----------+---------+