Search code examples
pythonpandasdataframemergepython-datetime

Pandas merge multiple dataframes on one temporal index, with latest value from all others


I'm merging some dataframes which have a time index.

import pandas as pd
df1 = pd.DataFrame(['a', 'b', 'c'],
    columns=pd.MultiIndex.from_product([['target'], ['key']]),
    index = [
        '2022-04-15 20:20:20.000000', 
        '2022-04-15 20:20:21.000000', 
        '2022-04-15 20:20:22.000000'],)
df2 = pd.DataFrame(['a2', 'b2', 'c2', 'd2', 'e2'],
    columns=pd.MultiIndex.from_product([['feature2'], ['keys']]),
    index = [
        '2022-04-15 20:20:20.100000', 
        '2022-04-15 20:20:20.500000', 
        '2022-04-15 20:20:20.900000', 
        '2022-04-15 20:20:21.000000', 
        '2022-04-15 20:20:21.100000',],)
df3 = pd.DataFrame(['a3', 'b3', 'c3', 'd3', 'e3'],
    columns=pd.MultiIndex.from_product([['feature3'], ['keys']]),
    index = [
        '2022-04-15 20:20:19.000000', 
        '2022-04-15 20:20:19.200000', 
        '2022-04-15 20:20:20.000000', 
        '2022-04-15 20:20:20.200000', 
        '2022-04-15 20:20:23.100000',],)

then I use this merge procedure:

def merge(dfs:list[pd.DataFrame], targetColumn:'str|tuple[str]'):
    from functools import reduce
    if len(dfs) == 0:
        return None
    if len(dfs) == 1:
        return dfs[0]
    for df in dfs:
        df.index = pd.to_datetime(df.index)
    merged = reduce(
        lambda left, right: pd.merge(
            left, 
            right, 
            how='outer',
            left_index=True,
            right_index=True),
        dfs)
    for col in merged.columns:
        if col != targetColumn:
            merged[col] = merged[col].fillna(method='ffill')
    return merged[merged[targetColumn].notna()]

like this:

merged = merge([df1, df2, df3], targetColumn=('target', 'key'))

which produces this:

target production

And it all works great. Problem is efficiency - notice in the merge() procedure I use reduce and an outer merge to join the dataframes together, this can make a HUGE interim dataframe which then gets filtered down. But what if my pc doesn't have enough ram to handle that huge dataframe in memory? well that's the problem I'm trying to avoid.

I'm wondering if there's a way to avoid expanding the data out into a huge dataframe while merging.

Of course a regular old merge isn't sufficient because it only merges on exactly matching indexes rather than the latest temporal index before the target variable's observation:

df1.merge(df2, how='left', left_index=True, right_index=True)

efficient but bad merge

Has this kind of thing been solved efficiently? Seems like a common data science issue, since no one wants to leak future information into their models, and everyone has various inputs to merge together...


Solution

  • You're in luck: pandas.merge_asof does exactly what you need!

    We use the default direction='backward' argument:

    A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

    Using your three example DataFrames:

    import pandas as pd
    from functools import reduce
    
    # Convert all indexes to datetime
    for df in [df1, df2, df3]:
        df.index = pd.to_datetime(df.index)
    
    # Perform as-of merges
    res = reduce(lambda left, right:
                 pd.merge_asof(left, right, left_index=True, right_index=True),
                 [df1, df2, df3])
    
    print(res)
    
                        target feature2 feature3
                           key     keys     keys
    2022-04-15 20:20:20      a      NaN       c3
    2022-04-15 20:20:21      b       d2       d3
    2022-04-15 20:20:22      c       e2       d3