Search code examples
pythonpandasdataframeoptimizationvectorization

how to speed up joining large datasets on nonequal time values


I have 2 large dataframes df1 and df2, both of which have a column time. I'd like to join these 2 tables. However, there might not always be an exact match. In this case, I'd like to join such that I take the latest time value in df2 that occurred before the time value for df1 in an efficient manner.

For example, given the tables

df1
Time             | val_1
------------------------
1/1/1980 1:00:00 | 1
1/1/1980 2:00:00 | 2
1/1/1980 3:00:00 | 3
1/1/1980 4:00:00 | 4

df2
time             | val_2
------------------------
1/1/1980 1:00:00 | 5
1/1/1980 1:59:59 | 6
1/1/1980 3:00:01 | 7
1/1/1980 3:30:30 | 8

The final table should be

time             | val_1 | val_2
--------------------------------
1/1/1980 1:00:00 | 1     | 5
1/1/1980 2:00:00 | 2     | 6
1/1/1980 3:00:00 | 3     | 6
1/1/1980 4:00:00 | 4     | 8

I'm currently doing this, but the runtime is too high

def prevrow(t):
    return df2.iloc[df2['time'].apply(lambda x: t - x if t >= x else np.nan).idxmin()]
pd.concat([df1,df1['Time'].apply(prevrow)], axis=1)

How would I speed this up?


Solution

  • We can try with merge_asof instead:

    # df1 = df1.rename(columns={'Time': 'time'})
    new_df = pd.merge_asof(df1, df2, on='time', direction='backward')
    

    *Note direction='backward' is the default direction so it does not need to be specified, however, this is the direction of matches we're looking for.

    new_df:

                     time  val_1  val_2
    0 1980-01-01 01:00:00      1      5
    1 1980-01-01 02:00:00      2      6
    2 1980-01-01 03:00:00      3      6
    3 1980-01-01 04:00:00      4      8
    

    The caveat is that the time column must be sorted in both DataFrames, this can be accomplished (if it is not already) by sort_values

    # df1 = df1.rename(columns={'Time': 'time'})
    new_df = pd.merge_asof(df1.sort_values('time'),
                           df2.sort_values('time'), 
                           on='time')
    

    Some timing information via %timeit:

    Original Approach:

    def prevrow(t):
        return df2.iloc[df2['time'].apply(lambda x: t - x if t >= x else np.nan).idxmin()]
    
    %timeit pd.concat([df1,df1['time'].apply(prevrow)], axis=1)
    2.29 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    merge_asof without sorting:

    %timeit pd.merge_asof(df1, df2, on='time')
    1.13 ms ± 50.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    merge_asof with sorting:

    %timeit pd.merge_asof(df1.sort_values('time'), df2.sort_values('time'), on='time')
    1.46 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Data and imports:

    import pandas as pd
    
    df1 = pd.DataFrame({
        'time': pd.to_datetime(['1/1/1980 1:00:00', '1/1/1980 2:00:00',
                                '1/1/1980 3:00:00', '1/1/1980 4:00:00']),
        'val_1': [1, 2, 3, 4]
    })
    
    df2 = pd.DataFrame({
        'time': pd.to_datetime(['1/1/1980 1:00:00', '1/1/1980 1:59:59',
                                '1/1/1980 3:00:01', '1/1/1980 3:30:30']),
        'val_2': [5, 6, 7, 8]
    })