I have a Dataframe sorted by time
in Spark like this:
id value time
1 a 2024-01-01 00:00:00
2 a 2024-01-01 00:00:00
2 b 2024-01-03 00:00:00
1 b 2024-01-03 01:00:00
2 c 2024-01-03 02:00:00
And I want to deduplicate the Dataframe, for each id, only the latest record will be left. Like this:
id value time
1 b 2024-01-03 01:00:00
2 c 2024-01-03 02:00:00
I try to use dataframe.dropDuplicates('id')
to achieve this, but seems it doesn't quite meet my requirement.
What's the rule for dropDuplicates
to choose which record to keep? Is dropDuplicates
always keeping the latest record or earliest record? Or did it just randomly choose one record to keep?
What do you think to use Window
function? For example in this way?
from pyspark.sql import Window
import pyspark.sql.functions as f
w1 = Window.partitionBy("id").orderBy(f.col("time").desc())
df.withColumn("row_num", f.row_number().over(w1)) \
.filter(f.col("row_num") == 1) \
.drop("row_num") \
.show()