Search code examples
pythonpandas

Selecting rows in pandas dataframe that contain same id and were created within 1 minute


I have a pandas dataframe that contains multiple rows with one identificator and the datetime of the row.

id creation_date
MP5694319-001 2024-05-29 12:58:00.489
MP5694319-001 2024-05-29 12:58:03.489
MP5694331-001 2024-05-29 12:58:05.489
MP5694319-001 2024-05-29 13:03:03.489

From this dataframe I need to filter rows that contain both the same id and were created within 1 minute from each other. The final result:

id creation_date
MP5694319-001 2024-05-29 12:58:00.489
MP5694319-001 2024-05-29 12:58:03.489

Dataframe contains multiple rows with various id, so I need to group them somehow and then apply the difference, but I have no idea how to do this.


Solution

  • You can use DataFrame.sort_values forst, get difference per groups in both directions by DataFrameGroupBy.diff and filter by Series.le for less or equal in boolean indexing:

    df['creation_date'] = pd.to_datetime(df['creation_date'])
    
    df = df.sort_values(['id','creation_date'])
    
    g = df.groupby('id')['creation_date']
    out = df[g.diff().le('1 min') | g.diff(-1).abs().le('1 min')]
    print (out)
                  id           creation_date
    0  MP5694319-001 2024-05-29 12:58:00.489
    1  MP5694319-001 2024-05-29 12:58:03.489
    

    Another idea is back filling first value of difference per groups for include in output:

    df['creation_date'] = pd.to_datetime(df['creation_date'])
    
    df = df.sort_values(['id','creation_date'])
    
    g = df.groupby('id')['creation_date']
    out = df[g.transform(lambda x: x.diff().bfill()).le('1 min')]
    print (out)
                  id           creation_date
    0  MP5694319-001 2024-05-29 12:58:00.489
    1  MP5694319-001 2024-05-29 12:58:03.489