Search code examples
pythondataframeapache-sparkpyspark

Identify Duplicate and Non-Dup records in a dataframe


I have a dataframe which contains both duplicate and distinct records in it. I have to identify which are the duplicate records and which are distinct records and split them separately in 2 different dataframes.

Input:

custid | cust_name | loc | prodid
1234   | John      | US  | P133
1234   | John      | US  | P133
1234   | John      | US  | P133
5678   | Mike      | CHN | P456
4325   | Peter     | RUS | P247
3458   | Andy      | IND | P764 
3458   | Andy      | IND | P764

Ouput: DF 1 (Dups):

custid | cust_name | loc | prodid
1234   | John      | US  | P133
1234   | John      | US  | P133
1234   | John      | US  | P133
3458   | Andy      | IND | P764 
3458   | Andy      | IND | P764

DF2 (Non Dups):

custid | cust_name | loc | prodid
5678   | Mike      | CHN | P456
4325   | Peter     | RUS | P247

Can someone please help.


Solution

  • Create a window specification to count the number of rows in each group then check the rows where count is greater than 1 to create a boolean flag has_dupes then filter the subsets using this flag

    W = Window.partitionBy(*df.columns)
    df = df.withColumn('has_dupes', F.count(F.lit(1)).over(W) > 1)
    
    df_dupes = df.filter('has_dupes')
    df_nodupes = df.filter('not has_dupes')
    

    df_nodupes.show()
    +------+---------+---+------+---------+
    |custid|cust_name|loc|prodid|has_dupes|
    +------+---------+---+------+---------+
    |  4325|    Peter|RUS|  P247|    false|
    |  5678|     Mike|CHN|  P456|    false|
    +------+---------+---+------+---------+
    
    df_dupes.show()
    +------+---------+---+------+---------+
    |custid|cust_name|loc|prodid|has_dupes|
    +------+---------+---+------+---------+
    |  1234|     John| US|  P133|     true|
    |  1234|     John| US|  P133|     true|
    |  1234|     John| US|  P133|     true|
    |  3458|     Andy|IND|  P764|     true|
    |  3458|     Andy|IND|  P764|     true|
    +------+---------+---+------+---------+