Having a 750k rows df
with 15 columns and a pd.Timestamp
as index
called ts
.
I process realtime data down to milliseconds in near-realtime.
Now I would like to apply some statistical data derived from a higher time resolution in df_stats
as new columns to the big df
. The df_stats
has a time resolution of 1 minute.
$ df
+----------------+---+---------+
| ts | A | new_col |
+----------------+---+---------+
| 11:33:11.31234 | 1 | 81 |
+----------------+---+---------+
| 11:33:11.64257 | 2 | 81 |
+----------------+---+---------+
| 11:34:10.12345 | 3 | 60 |
+----------------+---+---------+
$ df_stats
+----------------+----------------+
| ts | new_col_source |
+----------------+----------------+
| 11:33:00.00000 | 81 |
+----------------+----------------+
| 11:34:00.00000 | 60 |
+----------------+----------------+
Currently I have the code below, but it is inefficient, because it nees to iterate over the complete data.
I am wondering if there couldnt be an easier solution using pd.cut
, bin
or pd.Grouper
? Or something else to merge the time-buckets on the two indexes?
df_stats['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df['ts_timeonly'] = df.index.map(lambda x: x.replace(second=0, microsecond=0))
df = df.merge(df_stats, on='ts_timeonly', how='left', sort=True, suffixes=['', '_hist']).set_index('ts')
Let us try something new reindex
df_stats=df_stats.set_index('ts').reindex(df['ts'], method='nearest')
df_stats.index=df.index
df=pd.concat([df,df_stats],axis=1)
Or
df=pd.merge_asof(df, df_stats, on='ts',direction='nearest')