Search code examples
pythonpandasdrop-duplicates

Dropping duplicate rows in a Pandas DataFrame based on multiple column values


In a dataframe I need to drop/filter out duplicate rows based the combined columns A and B. In the example DataFrame

    A   B   C   D
0   1   1   3   9
1   1   2   4   8
2   1   3   5   7
3   1   3   4   6
4   1   4   5   5
5   1   4   6   4

rows 2 and 3, and 5 and 6 are duplicates and one of them should be dropped, keeping the row with the lowest value of

2 * C + 3 * D

To do this, I created a new temporary score column, S

df['S'] = 2 * df['C'] + 3 * df['D']

and finally to return the index of the minimum value for S

df.loc[df.groupby(['A', 'B'])['S'].idxmin()]
del ['S']

The result is

    A   B   C   D
0   1   1   3   9
1   1   2   4   8
3   1   3   4   6
5   1   4   6   4

However, is there a more efficient way of doing this, without having to add (and later drop) a new column?


Solution

  • Here is a way by using sort_values() and duplicated()

    df.loc[~df.sort_values('A',key = lambda x: df['C'].mul(2).add(df['D'].mul(3))).duplicated(subset = ['A','B'])]
    

    Output:

       A  B  C  D
    0  1  1  3  9
    1  1  2  4  8
    3  1  3  4  6
    5  1  4  6  4