I have a dataframe in PySpark:
id time replace
3241 2024-01-31 false
4344 2019-09-01 true
5775 2022-02-01 false
5394 2018-06-16 true
7645 2023-03-11 false
I want to find rows where replace == true
and replace the time column with the nearest time among rows where replace == false
. I want to do this for all rows where replace == true
.
You can solve it using a combination of window functions by partitioning by "replace" and ordering by "time":
w = Window.orderBy("time")
w2 = Window.partitionBy("tmp")
df.withColumn("tmp", when(col("replace"), 0).otherwise(1)) \
.withColumn("tmp", sum("tmp").over(w)) \
.withColumn("time", when(col("replace"), last(lag("time", -1).over(w)).over(w2)).otherwise(col("time"))) \
.drop("tmp")