I have a pandas dataframe with ID and dates like this:
ID | Date |
---|---|
111 | 16/09/2021 |
111 | 14/03/2022 |
111 | 18/03/2022 |
111 | 21/03/2022 |
111 | 22/03/2022 |
222 | 27/03/2022 |
222 | 30/03/2022 |
222 | 4/04/2022 |
222 | 6/04/2022 |
222 | 13/04/2022 |
For each ID, I would like to filter the table and remove observations that are within 7 days of each other. But I want to keep the earliest date of the dates that are within 7 days of each other so that each ID will have unique dates that are more than 7 days apart and do not contain other dates in between:
ID | Date |
---|---|
111 | 16/09/2021 |
111 | 14/03/2022 |
111 | 22/03/2022 |
222 | 27/03/2022 |
222 | 4/04/2022 |
222 | 13/04/2022 |
I'm quite new to python and pandas dataframe so hoping someone can assist and provide some pointers. There is a similar SO question How do I remove observations within 7 days of each other within a specific ID group? but this was done in R so hoping there is something similar that can be done with Pandas.
Here is a python version using panda which keeps the earliest date and then remove any subsequent dates that fall within a 7-day window of this date:
import pandas as pd
def filter_dates(group):
group = group.sort_values(by='Date')
filtered_dates = []
last_date = None
for date in group['Date']:
if last_date is None or (date - last_date).days > 7:
filtered_dates.append(date)
last_date = date
return group[group['Date'].isin(filtered_dates)]
if __name__ == "__main__":
data = {
'ID': [111, 111, 111, 111, 111, 222, 222, 222, 222, 222],
'Date': ['16/09/2021', '14/03/2022', '18/03/2022', '21/03/2022', '22/03/2022',
'27/03/2022', '30/03/2022', '04/04/2022', '06/04/2022', '13/04/2022']
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
result_df = df.groupby('ID').apply(filter_dates, include_groups=False).reset_index(drop=True)
print(result_df)
In this I convert it to a datetime format before grouping them by ID. The filter_dates() function then removes dates that are within 7 days of each other.
It prints out this:
ID level_1 Date
0 111 0 2021-09-16
1 111 1 2022-03-14
2 111 4 2022-03-22
3 222 5 2022-03-27
4 222 7 2022-04-04
5 222 9 2022-04-13