Search code examples
pandasduplicatespandas-groupbydata-cleaningpython-datetime

Pandas: Drop duplicates that appear within a time interval pandas


We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:

Current Dataset:

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021 

Goal:

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021      

Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...


Solution

  • You can try group by ID column and diff the DAY column in each group

    df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
    
    from datetime import timedelta
    
    m = timedelta(days=30)
    
    out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
    
    print(out)
    
       ID        DAY
    0   1 2020-03-22
    1   2 2020-05-10
    2   2 2020-01-13
    3   3 2020-03-30
    4   3 2021-02-24
    

    To convert to original date format, you can use dt.strftime

    out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
    
    print(out)
    
       ID         DAY
    0   1  22.03.2020
    1   2  10.05.2020
    2   2  13.01.2020
    3   3  30.03.2020
    4   3  24.02.2021