Search code examples
pythonpandasdataframemergedata-munging

Pandas dataframe how to merge 2 dfs based on timedelta?


I have two dataframes:

df1 = a1 a2 recorded_at
       1. 2. 2020-03-18 00:00:01
       8. 1. 2021-04-15 04:00:10
       9. 0. 2021-03-18 12:40:30

df2 = b1 b2 DateTime
       7. 8. 2020-03-18 00:00:01
       2. 4. 2020-03-18 00:00:04
       2. 6. 2021-04-15 04:00:12
       4. 2. 2021-03-18 12:40:40

I want to merge them by comparing recorded_at to DateTime, and taking all rows that within 4 seconds after. So I will get:

df_new = a1 a2 recorded_at DateTime b1 b2
       1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:01 7 8 
       1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:04 2 4 
       8. 1. 2021-04-15 04:00:10 2021-04-15 04:00:12 2 6

How can I do it?


Solution

  • If you don't expect to have more that one row of df1 that matches a single row of df2, then an efficient solution would be a merge_asof. Else, the merge computation will be quadratic, so greatly dependent on the size of each input.:

    df1['recorded_at'] = pd.to_datetime(df1['recorded_at'])
    df2['DateTime'] = pd.to_datetime(df2['DateTime'])
    
    out = (pd
     .merge_asof(df2.sort_values(by='DateTime'), df1.sort_values(by='recorded_at'), 
                 left_on='DateTime', right_on='recorded_at',
                 direction='backward', tolerance=pd.Timedelta('4s')
                 )
     .dropna(subset=['recorded_at'])
    )
    

    output:

        b1   b2            DateTime   a1   a2         recorded_at
    0  7.0  8.0 2020-03-18 00:00:01  1.0  2.0 2020-03-18 00:00:01
    1  2.0  4.0 2020-03-18 00:00:04  1.0  2.0 2020-03-18 00:00:01
    3  2.0  6.0 2021-04-15 04:00:12  8.0  1.0 2021-04-15 04:00:10