Identify Duplicate and Non-Dup records in a dataframe

I have a dataframe which contains both duplicate and distinct records in it. I have to identify which are the duplicate records and which are distinct records and split them separately in 2 different dataframes.

Input:

custid | cust_name | loc | prodid
1234   | John      | US  | P133
1234   | John      | US  | P133
1234   | John      | US  | P133
5678   | Mike      | CHN | P456
4325   | Peter     | RUS | P247
3458   | Andy      | IND | P764 
3458   | Andy      | IND | P764

Ouput: DF 1 (Dups):

custid | cust_name | loc | prodid
1234   | John      | US  | P133
1234   | John      | US  | P133
1234   | John      | US  | P133
3458   | Andy      | IND | P764 
3458   | Andy      | IND | P764

DF2 (Non Dups):

custid | cust_name | loc | prodid
5678   | Mike      | CHN | P456
4325   | Peter     | RUS | P247

Can someone please help.

Solution

Create a window specification to count the number of rows in each group then check the rows where count is greater than 1 to create a boolean flag has_dupes then filter the subsets using this flag

W = Window.partitionBy(*df.columns)
df = df.withColumn('has_dupes', F.count(F.lit(1)).over(W) > 1)

df_dupes = df.filter('has_dupes')
df_nodupes = df.filter('not has_dupes')

df_nodupes.show()
+------+---------+---+------+---------+
|custid|cust_name|loc|prodid|has_dupes|
+------+---------+---+------+---------+
|  4325|    Peter|RUS|  P247|    false|
|  5678|     Mike|CHN|  P456|    false|
+------+---------+---+------+---------+

df_dupes.show()
+------+---------+---+------+---------+
|custid|cust_name|loc|prodid|has_dupes|
+------+---------+---+------+---------+
|  1234|     John| US|  P133|     true|
|  1234|     John| US|  P133|     true|
|  1234|     John| US|  P133|     true|
|  3458|     Andy|IND|  P764|     true|
|  3458|     Andy|IND|  P764|     true|
+------+---------+---+------+---------+