Search code examples
pythonpandasdataframedrop-duplicates

Python Pandas Drop Consecutive Data Frames but Period (.) at the End is the Differentiator


Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor.

The only differentiator is a period at the end.

Header A
First
First.

I just want to drop the row that has a duplicate that does not have a period.


Solution

  • First sorting by Header A, then remove last . and get last duplicated values by Series.duplicated:

    print (df)
      Header A
    0   First.
    1    First
    2   First.
    3  Second.
    4   Second
    5    Third
    6    Third
    
    
    df1 = df.sort_values('Header A')
    df1 = df1[~df1['Header A'].str.rstrip('.').duplicated(keep='last')]
    print (df1)
      Header A
    2   First.
    3  Second.
    6    Third
    

    If need prioritize values without .:

    df1 = df.sort_values('Header A')
    df2 = df1[~df1['Header A'].str.rstrip('.').duplicated()]
    print (df2)
      Header A
    1    First
    4   Second
    5    Third