I have a DataFrame like this:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'name': ['mary','mary','mary','tom','tom','john','sarah','tom','tom','tom'], 'age': [30,30,30,25,25,28,36,25,25,25]})
id name age
0 1 mary 30
1 2 mary 30
2 3 mary 30
3 4 tom 25
4 5 tom 25
5 6 john 28
6 7 sarah 36
7 8 tom 25
8 9 tom 25
9 10 tom 25
There are multiple duplicated rows on it (not considering 'id' column).
I want to remove only the tail duplicated rows keeping the first one (not considering 'id' column).
I expect to get this:
id name age
0 1 mary 30
1 2 mary 30
2 3 mary 30
3 4 tom 25
4 5 tom 25
5 6 john 28
6 7 sarah 36
7 8 tom 25
I couldn't find a way to resolve it with drop_duplicates.
Code
cols = ['name', 'age'] # column to chk duplicate
grp = df[cols].ne(df[cols].shift()).any(axis=1).cumsum() # make duplicate group
cond = grp.shift().ne(grp.max())
out = df[cond]
out
id name age
0 1 mary 30
1 2 mary 30
2 3 mary 30
3 4 tom 25
4 5 tom 25
5 6 john 28
6 7 sarah 36
7 8 tom 25
Intermediate
df grp grp.shift() cond <- grp.shift() != 5
id name age
0 1 mary 30 1 NaN True
1 2 mary 30 1 1.0 True
2 3 mary 30 1 1.0 True
3 4 tom 25 2 1.0 True
4 5 tom 25 2 2.0 True
5 6 john 28 3 2.0 True
6 7 sarah 36 4 3.0 True
7 8 tom 25 5 4.0 True
8 9 tom 25 5 5.0 False
9 10 tom 25 5 5.0 False