Here is a suggested solution for listing the duplicate values by comparing two dataframes based on a column value. How to do the same thing for finding the duplicate rows containing in a dataframe itself by considering the given tolerance and list it in a new dataframe?
df
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510 1669 6 4 0 ... 0 NaN 33.400 73.200
1 1720.535 1720 7 15 0 ... 0 NaN 28.700 77.200
2 1780.000 1780 0 0 0 ... 0 NaN 35.000 77.000
3 1669.511 1669 6 4 0 ... 0 NaN 33.400 73.200
4 1803.665 1803 9 1 0 ... 0 NaN 30.300 78.800
5 1803.388 1803 5 22 15 ... 0 NaN 30.600 78.600.
If we looking for the duplicate rows based on YearDeci
with a tolerance level of 0.002
expected output
1669.510 1669 6 4 0 ... 0 NaN 33.400 73.200 3
1669.511 1669 6 4 0 ... 0 NaN 33.400 73.200
Or Just its index
index1 index2
0 3
0.002
of each other..diff()
to find the diference between rows and filter if .diff() is < 0.002
. You will also have to get the other row above it with .shift(-1)
:df = df.sort_values('YearDeci')
s = df['YearDeci'].diff() < 0.002
df[s | s.shift(-1)]
Out[1]:
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510 1669 6 4 0 ... 0 NaN 33.4 73.200
3 1669.511 1669 6 4 0 ... 0 NaN 33.4 73.200
To get the index values:
df[(s | s.shift(-1))].index.tolist()
# OR [*df[(s | s.shift(-1))].index]
[0, 3]