I have a pandas dataframe that contains multiple rows with one identificator and the datetime of the row.
id | creation_date |
---|---|
MP5694319-001 | 2024-05-29 12:58:00.489 |
MP5694319-001 | 2024-05-29 12:58:03.489 |
MP5694331-001 | 2024-05-29 12:58:05.489 |
MP5694319-001 | 2024-05-29 13:03:03.489 |
From this dataframe I need to filter rows that contain both the same id and were created within 1 minute from each other. The final result:
id | creation_date |
---|---|
MP5694319-001 | 2024-05-29 12:58:00.489 |
MP5694319-001 | 2024-05-29 12:58:03.489 |
Dataframe contains multiple rows with various id, so I need to group them somehow and then apply the difference, but I have no idea how to do this.
You can use DataFrame.sort_values
forst, get difference per groups in both directions by DataFrameGroupBy.diff
and filter by Series.le
for less or equal in boolean indexing:
df['creation_date'] = pd.to_datetime(df['creation_date'])
df = df.sort_values(['id','creation_date'])
g = df.groupby('id')['creation_date']
out = df[g.diff().le('1 min') | g.diff(-1).abs().le('1 min')]
print (out)
id creation_date
0 MP5694319-001 2024-05-29 12:58:00.489
1 MP5694319-001 2024-05-29 12:58:03.489
Another idea is back filling first value of difference per groups for include in output:
df['creation_date'] = pd.to_datetime(df['creation_date'])
df = df.sort_values(['id','creation_date'])
g = df.groupby('id')['creation_date']
out = df[g.transform(lambda x: x.diff().bfill()).le('1 min')]
print (out)
id creation_date
0 MP5694319-001 2024-05-29 12:58:00.489
1 MP5694319-001 2024-05-29 12:58:03.489