I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:
status | created_when | GP |
---|---|---|
A | 2022-10-10 | A1 |
B | 2022-10-12 | A1 |
B | 2022-10-13 | A1 |
C | 2022-10-13 | A1 |
C | 2022-10-14 | A1 |
B | 2022-12-15 | A1 |
C | 2022-12-16 | A1 |
D | 2022-12-17 | A1 |
A | 2022-12-18 | A1 |
This is what I need:
status | created_when | GP |
---|---|---|
A | 2022-10-10 | A1 |
B | 2022-10-12 | A1 |
C | 2022-10-13 | A1 |
B | 2022-12-15 | A1 |
C | 2022-12-16 | A1 |
D | 2022-12-17 | A1 |
A | 2022-12-18 | A1 |
I think something like but dont know how to implement either:
when(row[status] == row[status] + 1) then row[status]
Thank you for you help
I just find the answer to the problem. I realize i have to add another column to do the partitionBy.
w = Window.partitionBy('GP').orderBy("create_when")
df_1= df_0.withColumn("lag",F.lag("status").over(w))\
.where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))