Search code examples
pythonpandasdataframemergeleft-join

Pandas left join between datetimes


I have to dataframes - df and gdf

from datetime import datetime
import pandas as pd

data = [['foo', datetime(2020,1,1,0,0,0) ], ['foo', datetime(2020,2,1,0,0,0)], ['foo', datetime(2020,3,1,0,0,0)],
       ['bar', datetime(2020,4,1,0,0,0)],['bar', datetime(2020,5,1,0,0,0)],['bar', datetime(2020,6,1,0,0,0)]]
df = pd.DataFrame(data, columns = ['id', 'timestamp'])

data = [['A', datetime(2020,1,15,0,0,0), datetime(2020,3,15,0,0,0) ], ['B', datetime(2020,4,15,0,0,0),datetime(2020,6,15,0,0,0)]]
gdf = pd.DataFrame(data, columns = ['geoid', 'starttime', 'endtime'])


df
    id  timestamp
0   foo 2020-01-01
1   foo 2020-02-01
2   foo 2020-03-01
3   bar 2020-04-01
4   bar 2020-05-01
5   bar 2020-06-01

gdf
    geoid starttime     endtime
0   A     2020-01-15    2020-03-15
1   B     2020-04-15    2020-06-15

My goal is to left join gdf on df where timestamp is between the starttime and endtime so that the output looks like:

res
    id  timestamp   geoid
0   foo 2020-01-01  None
1   foo 2020-02-01  A
2   foo 2020-03-01  A
3   bar 2020-04-01  None
4   bar 2020-05-01  B
5   bar 2020-06-01  B

As far as I have researched, the only temporal join method in pandas that exists in pandas is pandas.merge_asof(), which does not fit this use case because the goal is to merge on between timestamps as opposed to closest.

What is the correct method in pandas (without using sqllite) to merge one table with another (left join) based on overlapping timestamps?


Solution

  • If possible use IntervalIndex created by gdf columns, then get positions by Index.get_indexer and get geoid by indexing in numpy with None if -1 (no match):

    s = pd.IntervalIndex.from_arrays(gdf['starttime'], gdf['endtime'], closed='both')
    
    arr = gdf['geoid'].to_numpy()
    pos = s.get_indexer(df['timestamp'])
    
    df['new'] = np.where(pos != -1, arr[pos], None)
    print (df)
        id  timestamp   new
    0  foo 2020-01-01  None
    1  foo 2020-02-01     A
    2  foo 2020-03-01     A
    3  bar 2020-04-01  None
    4  bar 2020-05-01     B
    5  bar 2020-06-01     B
        
            
    

    Or solution with cross join with convert index of df to column by reset_index for avoid lost index values and filter in Series.between with DataFrame.loc, last add new column by DataFrame.set_index for match by index column with df.index:

    df1 = df.reset_index().assign(a=1).merge(gdf.assign(a=1), on='a')
    df1 = df1.loc[df1['timestamp'].between(df1['starttime'], df1['endtime']), ['index','geoid']]
    
    df['geoid'] = df1.set_index('index')['geoid']
    print (df)
        id  timestamp geoid
    0  foo 2020-01-01   NaN
    1  foo 2020-02-01     A
    2  foo 2020-03-01     A
    3  bar 2020-04-01   NaN
    4  bar 2020-05-01     B
    5  bar 2020-06-01     B