Search code examples
dataframepysparkduplicates

How remove all copies of duplicates from pyspark dataframe


In this Pyspark Dataframe, there are some records which have duplicates on the ID fields.

ID start_dt
1 2020-02-09
1 2021-02-15
2 2022-05-04
3 2023-05-15

I want to remove all the records that have duplicates. In this case all rows with ID equals to 1 should be removed.

ID start_dt
2 2022-05-04
3 2023-05-15

How do I do this in pyspark


Solution

  • You can try to group by id and filter ids with count > 1, then you can join back with this filtered ids to get your desired output

    from pyspark.sql.functions import col
    
    data = [(1, "2020-02-09"), (1, "2021-02-15"), (2, "2022-05-04"), (3, "2023-05-15")]
    columns = ["ID", "start_dt"]
    df = spark.createDataFrame(data, columns)
    
    df.join(
        df.groupBy("ID").count().filter(col("count") == 1).select("ID"), "ID", "inner"
    ).show()