Search code examples
pandasdatetimedatediff

pandas -- compute min difference between datetimes in two different dataframes


I have 2 data frames consisting of times. I'd like to find the minimum time between ALL df1 times and EACH df2 when df2['Start Time'] - df1['Stop Time'] = dt is positive. For instance:

df1

Stop Time                 Site
 2023-10-17 20:10:00.310      P2
 2023-10-17 21:20:00.440      P1
 2023-10-17 23:30:00.200      P2
 2023-10-18 00:00:00.190      P1
 2023-10-18 01:00:00.130      P1
 2023-10-18 02:00:00.500      P2
 2023-10-18 03:00:00.480      P1
 2023-10-18 04:00:00.020      P2
 2023-10-18 05:00:00.000      P1
 2023-10-18 06:00:00.580      P2

df2

Start Time               Site
2023-10-17 16:00:00.190      SMR
2023-10-17 17:05:00.050      SMR
2023-10-17 19:10:00.550      SMR
2023-10-17 21:40:00.530      SMR
2023-10-17 22:21:00.180      SMR
2023-10-18 05:21:00.090      SMR
2023-10-18 09:15:00.360      SMR
2023-10-18 11:54:00.160      SMR

So for this dataset, the FIRST positive differences are for df2: 2023-10-17 21:40:00.530 and df1: 2023-10-17 20:10:00.310 AND 2023-10-17 21:20:00.440. The minimum that I'd like to keep in a new df_best dataframe is between 2023-10-17 21:40:00.530 - 2023-10-17 21:20:00.440 = 20 min along with the site name P1. So the first entry would be: df_best

diff_min       Site
5            P1

The last d2 entry, 2023-10-18 11:54:00.160, would have a min with the last entry in d1... about 5hrs and 54min.

I can do this with a couple for loops, but I bet there's a cool pandas way to do this quickly.

Thx,


Solution

  • You don't need to find all matches, just the closest one in the wanted direction.

    For that, use a merge_asof:

    df1['Stop Time'] = pd.to_datetime(df1['Stop Time'])
    df2['Start Time'] = pd.to_datetime(df2['Start Time'])
    
    out = (pd
      .merge_asof(df2.sort_values(by='Start Time')
                     .reset_index(),
                  df1.sort_values(by='Stop Time'),
                  left_on='Start Time', right_on='Stop Time',
                  suffixes=(None, '_df1')
                  )
           .set_index('index').reindex(df2.index)
           .assign(diff_min=lambda d: d['Start Time'].sub(d['Stop Time'])
                   .dt.total_seconds().div(60))
    )
    
    print(out)
    

    Output:

                   Start Time Site               Stop Time Site_df1    diff_min
    0 2023-10-17 16:00:00.190  SMR                     NaT      NaN         NaN
    1 2023-10-17 17:05:00.050  SMR                     NaT      NaN         NaN
    2 2023-10-17 19:10:00.550  SMR                     NaT      NaN         NaN
    3 2023-10-17 21:40:00.530  SMR 2023-10-17 21:20:00.440       P1   20.001500
    4 2023-10-17 22:21:00.180  SMR 2023-10-17 21:20:00.440       P1   60.995667
    5 2023-10-18 05:21:00.090  SMR 2023-10-18 05:00:00.000       P1   21.001500
    6 2023-10-18 09:15:00.360  SMR 2023-10-18 06:00:00.580       P2  194.996333
    7 2023-10-18 11:54:00.160  SMR 2023-10-18 06:00:00.580       P2  353.993000
    

    You can further dropna if you're only interested in df2 rows with a match.