Search code examples
pandasdataframemergepandas-groupbypandas-apply

Pandas merge two time series dataframes based on time window (cut/bin/merge)


Having a 750k rows df with 15 columns and a pd.Timestamp as index called ts. I process realtime data down to milliseconds in near-realtime.

Now I would like to apply some statistical data derived from a higher time resolution in df_stats as new columns to the big df. The df_stats has a time resolution of 1 minute.

$ df
+----------------+---+---------+
| ts             | A | new_col |
+----------------+---+---------+
| 11:33:11.31234 | 1 | 81      |
+----------------+---+---------+
| 11:33:11.64257 | 2 | 81      |
+----------------+---+---------+
| 11:34:10.12345 | 3 | 60      |
+----------------+---+---------+
$ df_stats
+----------------+----------------+
| ts             | new_col_source |
+----------------+----------------+
| 11:33:00.00000 | 81             |
+----------------+----------------+
| 11:34:00.00000 | 60             |
+----------------+----------------+

Currently I have the code below, but it is inefficient, because it nees to iterate over the complete data.

I am wondering if there couldnt be an easier solution using pd.cut, bin or pd.Grouper? Or something else to merge the time-buckets on the two indexes?

df_stats['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df = df.merge(df_stats, on='ts_timeonly', how='left', sort=True, suffixes=['', '_hist']).set_index('ts')

Solution

  • Let us try something new reindex

    df_stats=df_stats.set_index('ts').reindex(df['ts'], method='nearest')
    df_stats.index=df.index
    
    df=pd.concat([df,df_stats],axis=1)
    

    Or

    df=pd.merge_asof(df, df_stats, on='ts',direction='nearest')