Search code examples
pythonpandaspython-itertools

Find duplicates between 2 columns (independent order) , count and drop Python


I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them

df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
              columns=['source', 'target'],
              )

This is my expected result

    source  target   count
0     A       B        2
1     D       B        1
3     B       C        2

I've already tried several approaches, but I can't come close to a solution.

It does not matter which combination is maintained. In the result example I kept the first.


Solution

  • The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:

    df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
                  columns=['source', 'target'],)
    
    # Create column with set of both columns
    df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
    
    # Create count column based on new tmp column
    df['count'] = df.groupby(['tmp'])['target'].transform('size')
    
    # Drop duplicate rows based on new tmp column
    df = df[~df.duplicated(subset='tmp', keep='first')]
    
    # Remove tmp column
    df = df.drop('tmp', 1)
    
    df
    

    Output:

        source  target  count
    0   A   B   2
    1   D   B   1
    3   B   C   2