python apache-spark pyspark apache-spark-sql

Replace rows with nearest time using pyspark

I have a dataframe in PySpark:

id    time         replace
3241  2024-01-31   false
4344  2019-09-01   true
5775  2022-02-01   false
5394  2018-06-16   true
7645  2023-03-11   false

I want to find rows where replace == true and replace the time column with the nearest time among rows where replace == false. I want to do this for all rows where replace == true.

Solution

You can solve it using a combination of window functions by partitioning by "replace" and ordering by "time":

w = Window.orderBy("time")
w2 = Window.partitionBy("tmp")
df.withColumn("tmp", when(col("replace"), 0).otherwise(1)) \
    .withColumn("tmp", sum("tmp").over(w)) \
    .withColumn("time", when(col("replace"), last(lag("time", -1).over(w)).over(w2)).otherwise(col("time"))) \
    .drop("tmp")

Unexpected list append
Force matrix_world to be recalculated in Blender
SQLAlchemy and empty columns
ValueError: time data '24:00' does not match format '%H:%M'
Convert RDD of LabeledPoint to DataFrame toDF() Error
How to cancel trigonometric expressions in SymPy
Get view used in Django tests
Precompiled sasl python 3.9+ package for windows
Regex: Substitute pattern in string multiple times without leftovers
How to render raw html in the PyHTML library
Why does my implementation of trilateration give wrong results?
Django admin: how to sort by one of the custom list_display fields that has no database field
TypeError: not all arguments converted during string formatting - psycopg2
Is there a Python equivalent of the C# null-coalescing operator?
Kraken API - Account balances request returning Invalid Nonce
configparser without whitespace surrounding operator
Pytorch tensor to numpy array
Django: How to get a person whose birthday is today from a database?
Performance impact of inheriting from many classes
How can I do a line break (line continuation) in Python (split up a long line of source code)?
Using pydantic to change int to string
Breaking long method chains into multiple lines in Python
What do ** (double star/asterisk) and * (star/asterisk) mean in a function call?
How to install Pygame on Python 3.4?
Rotating values in a list [Python]
Launch default image viewer from pygtk program
what's the inverse of the quantile function on a pandas Series?
How can I install packages using pip according to the requirements.txt file from a local directory?
Python generate all n-permutations of n lists
FastAPI error when handling file together with form-data defined in a Pydantic model