Search code examples
pythonpandasdrop-duplicates

Pandas - Opposite of drop duplicates, keep first


I'm familiar with how to drop duplicate rows, and then using the parameter of first, last, none. Nothing too complicated with that and there's plenty of examples (ie here).

However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:

So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"

Example:

Given this dataframe, and looking at duplicates in cost column:

    ID name type cost
0    0    a   bb    1
1    1    a   cc    2 <--- there are duplicates, so drop this row
2  1_0    a   dd    2
3    2    a   ee    3 <--- there are duplicates, so drop this row
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

If there are duplicates in the cost column, just drop the first occurrence, but keep the rest.

So my output would be:

    ID name type cost
0    0    a   bb    1
2  1_0    a   dd    2
4  2_0    a   ff    3
5  2_1    a   gg    3
6  2_2    a   hh    3

Heres the sample dataframe:

import pandas as pd

df = pd.DataFrame([
['0',   'a',    'bb',   '1'],
['1',   'a',    'cc',   '2'],
['1_0', 'a',    'dd',   '2'],
['2',   'a',    'ee',   '3'],
['2_0', 'a',    'ff',   '3'],
['2_1', 'a',    'gg',   '3'],
['2_2', 'a',    'hh',   '3']], columns = ['ID', 'name', 'type', 'cost'])

Solution

  • You can chain 2 masks created by DataFrame.duplicated with bitwise OR and filter by boolean indexing:

    df = df[df.duplicated('cost') | ~df.duplicated('cost', keep=False)]
    print (df)
        ID name type cost
    0    0    a   bb    1
    2  1_0    a   dd    2
    4  2_0    a   ff    3
    5  2_1    a   gg    3
    6  2_2    a   hh    3
    

    Detail:

    print (df.assign(mask1=df.duplicated('cost'), mask2=~df.duplicated('cost', keep=False)))
        ID name type cost  mask1  mask2
    0    0    a   bb    1  False   True
    1    1    a   cc    2  False  False
    2  1_0    a   dd    2   True  False
    3    2    a   ee    3  False  False
    4  2_0    a   ff    3   True  False
    5  2_1    a   gg    3   True  False
    6  2_2    a   hh    3   True  False