Search code examples
dataframepysparkrow

Get first value from consecutive rows PySpark


I am using PySpark and I want to get the first status order by Date but only if there consecutive status, because the status can be more than once, but not more than one in a row. Here is my example of what i have:

status created_when GP
A 2022-10-10 A1
B 2022-10-12 A1
B 2022-10-13 A1
C 2022-10-13 A1
C 2022-10-14 A1
B 2022-12-15 A1
C 2022-12-16 A1
D 2022-12-17 A1
A 2022-12-18 A1

This is what I need:

status created_when GP
A 2022-10-10 A1
B 2022-10-12 A1
C 2022-10-13 A1
B 2022-12-15 A1
C 2022-12-16 A1
D 2022-12-17 A1
A 2022-12-18 A1

I think something like but dont know how to implement either:

when(row[status] == row[status] + 1) then row[status]

Thank you for you help


Solution

  • I just find the answer to the problem. I realize i have to add another column to do the partitionBy.

    w  = Window.partitionBy('GP').orderBy("create_when")
    df_1= df_0.withColumn("lag",F.lag("status").over(w))\
               .where((F.col("status") != F.col("lag")) | (F.col("lag").isNull()))