Search code examples
apache-sparkpysparkapache-spark-sql

Could DataFrame.dropDuplicates used to keep only the latest data in Spark?


I have a Dataframe sorted by time in Spark like this:

id     value     time
1      a         2024-01-01 00:00:00
2      a         2024-01-01 00:00:00
2      b         2024-01-03 00:00:00
1      b         2024-01-03 01:00:00
2      c         2024-01-03 02:00:00

And I want to deduplicate the Dataframe, for each id, only the latest record will be left. Like this:

id     value     time
1      b         2024-01-03 01:00:00
2      c         2024-01-03 02:00:00

I try to use dataframe.dropDuplicates('id') to achieve this, but seems it doesn't quite meet my requirement.

What's the rule for dropDuplicates to choose which record to keep? Is dropDuplicates always keeping the latest record or earliest record? Or did it just randomly choose one record to keep?


Solution

  • What do you think to use Window function? For example in this way?

    from pyspark.sql import Window
    import pyspark.sql.functions as f
    
    w1 = Window.partitionBy("id").orderBy(f.col("time").desc())
    df.withColumn("row_num", f.row_number().over(w1)) \
      .filter(f.col("row_num") == 1) \
      .drop("row_num") \
      .show()