Search code examples
pythonpandasgroup-by

Remove observations within 7 days of each other within specific ID group


I have a pandas dataframe with ID and dates like this:

ID Date
111 16/09/2021
111 14/03/2022
111 18/03/2022
111 21/03/2022
111 22/03/2022
222 27/03/2022
222 30/03/2022
222 4/04/2022
222 6/04/2022
222 13/04/2022

For each ID, I would like to filter the table and remove observations that are within 7 days of each other. But I want to keep the earliest date of the dates that are within 7 days of each other so that each ID will have unique dates that are more than 7 days apart and do not contain other dates in between:

ID Date
111 16/09/2021
111 14/03/2022
111 22/03/2022
222 27/03/2022
222 4/04/2022
222 13/04/2022

I'm quite new to python and pandas dataframe so hoping someone can assist and provide some pointers. There is a similar SO question How do I remove observations within 7 days of each other within a specific ID group? but this was done in R so hoping there is something similar that can be done with Pandas.


Solution

  • Here is a python version using panda which keeps the earliest date and then remove any subsequent dates that fall within a 7-day window of this date:

    import pandas as pd
    
    def filter_dates(group):
        group = group.sort_values(by='Date')
        filtered_dates = []
        last_date = None
        for date in group['Date']:
            if last_date is None or (date - last_date).days > 7:
                filtered_dates.append(date)
                last_date = date
        return group[group['Date'].isin(filtered_dates)]
    
    
    if __name__ == "__main__":
        data = {
            'ID': [111, 111, 111, 111, 111, 222, 222, 222, 222, 222],
            'Date': ['16/09/2021', '14/03/2022', '18/03/2022', '21/03/2022', '22/03/2022',
                     '27/03/2022', '30/03/2022', '04/04/2022', '06/04/2022', '13/04/2022']
        }
        df = pd.DataFrame(data)
        df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
        result_df = df.groupby('ID').apply(filter_dates, include_groups=False).reset_index(drop=True)
        print(result_df)
    

    In this I convert it to a datetime format before grouping them by ID. The filter_dates() function then removes dates that are within 7 days of each other.

    It prints out this:

        ID  level_1       Date
    0  111        0 2021-09-16
    1  111        1 2022-03-14
    2  111        4 2022-03-22
    3  222        5 2022-03-27
    4  222        7 2022-04-04
    5  222        9 2022-04-13