I have two dataframes:
df1 = a1 a2 recorded_at
1. 2. 2020-03-18 00:00:01
8. 1. 2021-04-15 04:00:10
9. 0. 2021-03-18 12:40:30
df2 = b1 b2 DateTime
7. 8. 2020-03-18 00:00:01
2. 4. 2020-03-18 00:00:04
2. 6. 2021-04-15 04:00:12
4. 2. 2021-03-18 12:40:40
I want to merge them by comparing recorded_at
to DateTime
, and taking all rows that within 4 seconds after.
So I will get:
df_new = a1 a2 recorded_at DateTime b1 b2
1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:01 7 8
1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:04 2 4
8. 1. 2021-04-15 04:00:10 2021-04-15 04:00:12 2 6
How can I do it?
If you don't expect to have more that one row of df1 that matches a single row of df2, then an efficient solution would be a merge_asof
. Else, the merge
computation will be quadratic, so greatly dependent on the size of each input.:
df1['recorded_at'] = pd.to_datetime(df1['recorded_at'])
df2['DateTime'] = pd.to_datetime(df2['DateTime'])
out = (pd
.merge_asof(df2.sort_values(by='DateTime'), df1.sort_values(by='recorded_at'),
left_on='DateTime', right_on='recorded_at',
direction='backward', tolerance=pd.Timedelta('4s')
)
.dropna(subset=['recorded_at'])
)
output:
b1 b2 DateTime a1 a2 recorded_at
0 7.0 8.0 2020-03-18 00:00:01 1.0 2.0 2020-03-18 00:00:01
1 2.0 4.0 2020-03-18 00:00:04 1.0 2.0 2020-03-18 00:00:01
3 2.0 6.0 2021-04-15 04:00:12 8.0 1.0 2021-04-15 04:00:10