pythonapache-sparkpysparkrow-number

Need to add sequential numbering as per the grouping in Pyspark


I am working on one code where I need to add sequential number as per the grouping on the basis of column A & column B. Below is the table/dataframe I have. The data is sorted by colA & Date.

colA colB Date
A 1 01-01-2014
A 1 01-02-2014
A 3 30-04-2014
A 3 05-05-2014
A 2 25-05-2014
A 1 06-06-2014
A 1 21-07-2014
B 1 04-09-2014
B 1 19-10-2014
B 1 03-12-2014
C 3 17-01-2015
C 2 03-03-2015
C 2 17-04-2015

Following is the expected result:

colA colB Date ROWNUM
A 1 01-01-2014 1
A 1 01-02-2014 2
A 3 30-04-2014 1
A 3 05-05-2014 2
A 2 25-05-2014 1
A 1 06-06-2014 1
A 1 21-07-2014 2
B 1 04-09-2014 1
B 1 19-10-2014 2
B 1 03-12-2014 3
C 3 17-01-2015 1
C 2 03-03-2015 1
C 2 17-04-2015 2

I am trying to use row_number here and getting the below result where for A(colA) the next occurrence of 1(colB) has rownumber added as 3 instead of 1:

colA colB Date ROWNUM
A 1 01-01-2014 1
A 1 01-02-2014 2
A 3 30-04-2014 1
A 3 05-05-2014 2
A 2 25-05-2014 1
A 1 06-06-2014 3
A 1 21-07-2014 4
B 1 04-09-2014 1
B 1 19-10-2014 2
B 1 03-12-2014 3
C 3 17-01-2015 1
C 2 03-03-2015 1
C 2 17-04-2015 2

Solution

  • This would work:

    w=Window.orderBy(F.asc("colA"), F.asc("Date"))
    
    df\
    .withColumn("eq", F.when(F.lag("colB").over(w)==F.col("colB"), 0).otherwise(1))\
    .withColumn("groups", F.sum("eq").over(w))\
    .withColumn("row_num", F.row_number().over(w.partitionBy(F.col("colA"), F.col("groups"))))\
    .show()
    

    Thanks to this answer: Window functions: PARTITION BY one column after ORDER BY another

    Input:

    +----+----+----------+
    |colA|colB|      Date|
    +----+----+----------+
    |   A|   1|2014-01-01|
    |   A|   1|2014-02-01|
    |   A|   3|2014-04-30|
    |   A|   3|2014-05-05|
    |   A|   2|2014-05-25|
    |   A|   1|2014-06-06|
    |   A|   1|2014-07-21|
    |   B|   1|2014-09-04|
    |   B|   1|2014-10-19|
    |   B|   1|2014-12-03|
    |   C|   3|2015-01-17|
    |   C|   2|2015-03-03|
    |   C|   2|2015-04-17|
    +----+----+----------+
    

    Output (Keeping the intermediate columns for understanding, you can drop them):

    +----+----+----------+---+------+-------+
    |colA|colB|      Date| eq|groups|row_num|
    +----+----+----------+---+------+-------+
    |   A|   1|2014-01-01|  1|     1|      1|
    |   A|   1|2014-02-01|  0|     1|      2|
    |   A|   3|2014-04-30|  1|     2|      1|
    |   A|   3|2014-05-05|  0|     2|      2|
    |   A|   2|2014-05-25|  1|     3|      1|
    |   A|   1|2014-06-06|  1|     4|      1|
    |   A|   1|2014-07-21|  0|     4|      2|
    |   B|   1|2014-09-04|  0|     4|      1|
    |   B|   1|2014-10-19|  0|     4|      2|
    |   B|   1|2014-12-03|  0|     4|      3|
    |   C|   3|2015-01-17|  1|     5|      1|
    |   C|   2|2015-03-03|  1|     6|      1|
    |   C|   2|2015-04-17|  0|     6|      2|
    +----+----+----------+---+------+-------+