Search code examples
pythonpandasdataframedrop-duplicates

How to keep only first duplicated row from Pandas DataFrame tail?


I have a DataFrame like this:

import pandas as pd

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10], 'name': ['mary','mary','mary','tom','tom','john','sarah','tom','tom','tom'], 'age': [30,30,30,25,25,28,36,25,25,25]})

    id    name  age
0    1    mary   30
1    2    mary   30
2    3    mary   30
3    4     tom   25
4    5     tom   25
5    6    john   28
6    7   sarah   36
7    8     tom   25
8    9     tom   25
9   10     tom   25

There are multiple duplicated rows on it (not considering 'id' column).

I want to remove only the tail duplicated rows keeping the first one (not considering 'id' column).

I expect to get this:

    id    name  age
0    1    mary   30
1    2    mary   30
2    3    mary   30
3    4     tom   25
4    5     tom   25
5    6    john   28
6    7   sarah   36
7    8     tom   25

I couldn't find a way to resolve it with drop_duplicates.


Solution

  • Code

    cols = ['name', 'age'] # column to chk duplicate
    grp = df[cols].ne(df[cols].shift()).any(axis=1).cumsum() # make duplicate group
    cond = grp.shift().ne(grp.max())
    out = df[cond]
    

    out

       id   name  age
    0   1   mary   30
    1   2   mary   30
    2   3   mary   30
    3   4    tom   25
    4   5    tom   25
    5   6   john   28
    6   7  sarah   36
    7   8    tom   25
    

    Intermediate

    df                    grp   grp.shift()   cond <- grp.shift() != 5
       id   name  age    
    0   1   mary   30      1       NaN        True
    1   2   mary   30      1       1.0        True
    2   3   mary   30      1       1.0        True
    3   4    tom   25      2       1.0        True
    4   5    tom   25      2       2.0        True
    5   6   john   28      3       2.0        True
    6   7  sarah   36      4       3.0        True
    7   8    tom   25      5       4.0        True
    8   9    tom   25      5       5.0        False
    9  10    tom   25      5       5.0        False