Search code examples
pythonpandasdataframeduplicatescomparison

How to find duplicate rows in a dataframe with given tolerance based on a single/multiple column value?


Here is a suggested solution for listing the duplicate values by comparing two dataframes based on a column value. How to do the same thing for finding the duplicate rows containing in a dataframe itself by considering the given tolerance and list it in a new dataframe?

df
         YearDeci  Year  Month  Day  Hour  ...  Seconds   Mb     Lat     Lon  
0     1669.510  1669      6    4     0  ...        0  NaN  33.400  73.200    
1     1720.535  1720      7   15     0  ...        0  NaN  28.700  77.200    
2     1780.000  1780      0    0     0  ...        0  NaN  35.000  77.000    
3     1669.511  1669      6    4     0  ...        0  NaN  33.400  73.200    
4     1803.665  1803      9    1     0  ...        0  NaN  30.300  78.800
5     1803.388  1803      5   22    15  ...        0  NaN  30.600  78.600.

If we looking for the duplicate rows based on YearDeci with a tolerance level of 0.002

expected output

1669.510  1669      6    4     0  ...        0  NaN  33.400  73.200   3     
1669.511  1669      6    4     0  ...        0  NaN  33.400  73.200

Or Just its index

index1 index2
0      3

Solution

    1. First sort the dataframe by the column where you are looking for rows that are within 0.002 of each other.
    2. Then, you can use .diff() to find the diference between rows and filter if .diff() is < 0.002. You will also have to get the other row above it with .shift(-1):

    df = df.sort_values('YearDeci')
    s = df['YearDeci'].diff() < 0.002
    df[s | s.shift(-1)]
    Out[1]: 
       YearDeci  Year  Month  Day  Hour  ...  Seconds  Mb   Lat     Lon
    0  1669.510  1669      6    4     0  ...        0 NaN  33.4  73.200
    3  1669.511  1669      6    4     0  ...        0 NaN  33.4  73.200
    

    To get the index values:

    df[(s | s.shift(-1))].index.tolist()
    # OR [*df[(s | s.shift(-1))].index]
    
    [0, 3]