In this Pyspark Dataframe, there are some records which have duplicates on the ID
fields.
ID | start_dt |
---|---|
1 | 2020-02-09 |
1 | 2021-02-15 |
2 | 2022-05-04 |
3 | 2023-05-15 |
I want to remove all the records that have duplicates. In this case all rows with ID
equals to 1 should be removed.
ID | start_dt |
---|---|
2 | 2022-05-04 |
3 | 2023-05-15 |
How do I do this in pyspark
You can try to group by id and filter ids with count > 1, then you can join back with this filtered ids to get your desired output
from pyspark.sql.functions import col
data = [(1, "2020-02-09"), (1, "2021-02-15"), (2, "2022-05-04"), (3, "2023-05-15")]
columns = ["ID", "start_dt"]
df = spark.createDataFrame(data, columns)
df.join(
df.groupBy("ID").count().filter(col("count") == 1).select("ID"), "ID", "inner"
).show()