I have a dataframe which contains both duplicate and distinct records in it. I have to identify which are the duplicate records and which are distinct records and split them separately in 2 different dataframes.
Input:
custid | cust_name | loc | prodid
1234 | John | US | P133
1234 | John | US | P133
1234 | John | US | P133
5678 | Mike | CHN | P456
4325 | Peter | RUS | P247
3458 | Andy | IND | P764
3458 | Andy | IND | P764
Ouput: DF 1 (Dups):
custid | cust_name | loc | prodid
1234 | John | US | P133
1234 | John | US | P133
1234 | John | US | P133
3458 | Andy | IND | P764
3458 | Andy | IND | P764
DF2 (Non Dups):
custid | cust_name | loc | prodid
5678 | Mike | CHN | P456
4325 | Peter | RUS | P247
Can someone please help.
Create a window specification to count the number of rows in each group then check the rows where count
is greater than 1 to create a boolean flag has_dupes
then filter
the subsets using this flag
W = Window.partitionBy(*df.columns)
df = df.withColumn('has_dupes', F.count(F.lit(1)).over(W) > 1)
df_dupes = df.filter('has_dupes')
df_nodupes = df.filter('not has_dupes')
df_nodupes.show()
+------+---------+---+------+---------+
|custid|cust_name|loc|prodid|has_dupes|
+------+---------+---+------+---------+
| 4325| Peter|RUS| P247| false|
| 5678| Mike|CHN| P456| false|
+------+---------+---+------+---------+
df_dupes.show()
+------+---------+---+------+---------+
|custid|cust_name|loc|prodid|has_dupes|
+------+---------+---+------+---------+
| 1234| John| US| P133| true|
| 1234| John| US| P133| true|
| 1234| John| US| P133| true|
| 3458| Andy|IND| P764| true|
| 3458| Andy|IND| P764| true|
+------+---------+---+------+---------+