Hi I have a section of my pandas dataframe that has duplicates, but the difference is minor.
The only differentiator is a period at the end.
Header A |
---|
First |
First. |
I just want to drop the row that has a duplicate that does not have a period.
First sorting by Header A
, then remove last .
and get last duplicated values by Series.duplicated
:
print (df)
Header A
0 First.
1 First
2 First.
3 Second.
4 Second
5 Third
6 Third
df1 = df.sort_values('Header A')
df1 = df1[~df1['Header A'].str.rstrip('.').duplicated(keep='last')]
print (df1)
Header A
2 First.
3 Second.
6 Third
If need prioritize values without .
:
df1 = df.sort_values('Header A')
df2 = df1[~df1['Header A'].str.rstrip('.').duplicated()]
print (df2)
Header A
1 First
4 Second
5 Third