pandas duplicates pandas-groupby data-cleaning python-datetime

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:

Current Dataset:

   ID        DAY           
0   1  22.03.2020       
1   1  18.04.2020       
2   2  10.05.2020       
3   2  13.01.2020       
4   3  30.03.2020       
5   3  31.03.2020       
6   3  24.02.2021

Goal:

   ID     DAY           
0   1  22.03.2020       
1   2  10.05.2020       
2   2  13.01.2020       
3   3  30.03.2020       
4   3  24.02.2021

Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...

Solution

You can try group by ID column and diff the DAY column in each group

df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)

from datetime import timedelta

m = timedelta(days=30)

out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)

print(out)

   ID        DAY
0   1 2020-03-22
1   2 2020-05-10
2   2 2020-01-13
3   3 2020-03-30
4   3 2021-02-24

To convert to original date format, you can use dt.strftime

out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')

print(out)

   ID         DAY
0   1  22.03.2020
1   2  10.05.2020
2   2  13.01.2020
3   3  30.03.2020
4   3  24.02.2021