Search code examples
pythonapache-sparkpysparkapache-spark-sql

Replace rows with nearest time using pyspark


I have a dataframe in PySpark:

id    time         replace
3241  2024-01-31   false
4344  2019-09-01   true
5775  2022-02-01   false
5394  2018-06-16   true
7645  2023-03-11   false

I want to find rows where replace == true and replace the time column with the nearest time among rows where replace == false. I want to do this for all rows where replace == true.


Solution

  • You can solve it using a combination of window functions by partitioning by "replace" and ordering by "time":

    w = Window.orderBy("time")
    w2 = Window.partitionBy("tmp")
    df.withColumn("tmp", when(col("replace"), 0).otherwise(1)) \
        .withColumn("tmp", sum("tmp").over(w)) \
        .withColumn("time", when(col("replace"), last(lag("time", -1).over(w)).over(w2)).otherwise(col("time"))) \
        .drop("tmp")