I'm familiar with how to drop duplicate rows, and then using the parameter of first
, last
, none
. Nothing too complicated with that and there's plenty of examples (ie here).
However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:
So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"
Example:
Given this dataframe, and looking at duplicates in cost
column:
ID name type cost
0 0 a bb 1
1 1 a cc 2 <--- there are duplicates, so drop this row
2 1_0 a dd 2
3 2 a ee 3 <--- there are duplicates, so drop this row
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
If there are duplicates in the cost
column, just drop the first occurrence, but keep the rest.
So my output would be:
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
Heres the sample dataframe:
import pandas as pd
df = pd.DataFrame([
['0', 'a', 'bb', '1'],
['1', 'a', 'cc', '2'],
['1_0', 'a', 'dd', '2'],
['2', 'a', 'ee', '3'],
['2_0', 'a', 'ff', '3'],
['2_1', 'a', 'gg', '3'],
['2_2', 'a', 'hh', '3']], columns = ['ID', 'name', 'type', 'cost'])
You can chain 2 masks created by DataFrame.duplicated
with bitwise OR
and filter by boolean indexing
:
df = df[df.duplicated('cost') | ~df.duplicated('cost', keep=False)]
print (df)
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
Detail:
print (df.assign(mask1=df.duplicated('cost'), mask2=~df.duplicated('cost', keep=False)))
ID name type cost mask1 mask2
0 0 a bb 1 False True
1 1 a cc 2 False False
2 1_0 a dd 2 True False
3 2 a ee 3 False False
4 2_0 a ff 3 True False
5 2_1 a gg 3 True False
6 2_2 a hh 3 True False