I am working on one code where I need to add sequential number as per the grouping on the basis of column A & column B. Below is the table/dataframe I have. The data is sorted by colA & Date.
colA | colB | Date |
---|---|---|
A | 1 | 01-01-2014 |
A | 1 | 01-02-2014 |
A | 3 | 30-04-2014 |
A | 3 | 05-05-2014 |
A | 2 | 25-05-2014 |
A | 1 | 06-06-2014 |
A | 1 | 21-07-2014 |
B | 1 | 04-09-2014 |
B | 1 | 19-10-2014 |
B | 1 | 03-12-2014 |
C | 3 | 17-01-2015 |
C | 2 | 03-03-2015 |
C | 2 | 17-04-2015 |
Following is the expected result:
colA | colB | Date | ROWNUM |
---|---|---|---|
A | 1 | 01-01-2014 | 1 |
A | 1 | 01-02-2014 | 2 |
A | 3 | 30-04-2014 | 1 |
A | 3 | 05-05-2014 | 2 |
A | 2 | 25-05-2014 | 1 |
A | 1 | 06-06-2014 | 1 |
A | 1 | 21-07-2014 | 2 |
B | 1 | 04-09-2014 | 1 |
B | 1 | 19-10-2014 | 2 |
B | 1 | 03-12-2014 | 3 |
C | 3 | 17-01-2015 | 1 |
C | 2 | 03-03-2015 | 1 |
C | 2 | 17-04-2015 | 2 |
I am trying to use row_number here and getting the below result where for A(colA) the next occurrence of 1(colB) has rownumber added as 3 instead of 1:
colA | colB | Date | ROWNUM |
---|---|---|---|
A | 1 | 01-01-2014 | 1 |
A | 1 | 01-02-2014 | 2 |
A | 3 | 30-04-2014 | 1 |
A | 3 | 05-05-2014 | 2 |
A | 2 | 25-05-2014 | 1 |
A | 1 | 06-06-2014 | 3 |
A | 1 | 21-07-2014 | 4 |
B | 1 | 04-09-2014 | 1 |
B | 1 | 19-10-2014 | 2 |
B | 1 | 03-12-2014 | 3 |
C | 3 | 17-01-2015 | 1 |
C | 2 | 03-03-2015 | 1 |
C | 2 | 17-04-2015 | 2 |
This would work:
w=Window.orderBy(F.asc("colA"), F.asc("Date"))
df\
.withColumn("eq", F.when(F.lag("colB").over(w)==F.col("colB"), 0).otherwise(1))\
.withColumn("groups", F.sum("eq").over(w))\
.withColumn("row_num", F.row_number().over(w.partitionBy(F.col("colA"), F.col("groups"))))\
.show()
Thanks to this answer: Window functions: PARTITION BY one column after ORDER BY another
Input:
+----+----+----------+
|colA|colB| Date|
+----+----+----------+
| A| 1|2014-01-01|
| A| 1|2014-02-01|
| A| 3|2014-04-30|
| A| 3|2014-05-05|
| A| 2|2014-05-25|
| A| 1|2014-06-06|
| A| 1|2014-07-21|
| B| 1|2014-09-04|
| B| 1|2014-10-19|
| B| 1|2014-12-03|
| C| 3|2015-01-17|
| C| 2|2015-03-03|
| C| 2|2015-04-17|
+----+----+----------+
Output (Keeping the intermediate columns for understanding, you can drop them):
+----+----+----------+---+------+-------+
|colA|colB| Date| eq|groups|row_num|
+----+----+----------+---+------+-------+
| A| 1|2014-01-01| 1| 1| 1|
| A| 1|2014-02-01| 0| 1| 2|
| A| 3|2014-04-30| 1| 2| 1|
| A| 3|2014-05-05| 0| 2| 2|
| A| 2|2014-05-25| 1| 3| 1|
| A| 1|2014-06-06| 1| 4| 1|
| A| 1|2014-07-21| 0| 4| 2|
| B| 1|2014-09-04| 0| 4| 1|
| B| 1|2014-10-19| 0| 4| 2|
| B| 1|2014-12-03| 0| 4| 3|
| C| 3|2015-01-17| 1| 5| 1|
| C| 2|2015-03-03| 1| 6| 1|
| C| 2|2015-04-17| 0| 6| 2|
+----+----+----------+---+------+-------+