I have one df with data roughly every 15 seconds (df1) and another roughly every 5 minutes (df2) that are listed below. I would like to have the "time" variable of df2 matched to the closest timestamp of df1, then have that closest timestamp as the center of 5-minute averaged df1 data.
df1:
time speed
0 2022-10-04 00:00:24 4.590
1 2022-10-04 00:00:41 4.389
2 2022-10-04 00:00:57 4.367
3 2022-10-04 00:01:14 4.539
4 2022-10-04 00:01:30 4.584
5 2022-10-04 00:01:48 4.523
6 2022-10-04 00:02:05 4.498
7 2022-10-04 00:02:21 4.625
8 2022-10-04 00:02:38 4.497
9 2022-10-04 00:02:54 4.406
10 2022-10-04 00:03:12 4.502
11 2022-10-04 00:03:28 4.494
12 2022-10-04 00:03:45 4.445
13 2022-10-04 00:04:01 4.438
14 2022-10-04 00:04:18 4.433
15 2022-10-04 00:04:36 4.441
16 2022-10-04 00:04:52 4.400
17 2022-10-04 00:05:09 4.221
18 2022-10-04 00:05:27 4.115
19 2022-10-04 00:05:43 4.009
20 2022-10-04 00:06:01 4.230
21 2022-10-04 00:06:18 4.360
22 2022-10-04 00:06:34 4.331
23 2022-10-04 00:06:51 4.178
24 2022-10-04 00:07:07 4.238
25 2022-10-04 00:07:25 4.125
26 2022-10-04 00:07:43 3.988
27 2022-10-04 00:08:17 3.573
28 2022-10-04 00:08:34 4.471
29 2022-10-04 00:08:50 4.567
30 2022-10-04 00:09:08 4.451
31 2022-10-04 00:09:25 4.311
32 2022-10-04 00:09:42 4.280
33 2022-10-04 00:09:59 4.439
34 2022-10-04 00:10:17 4.410
35 2022-10-04 00:10:35 4.335
36 2022-10-04 00:10:51 4.193
37 2022-10-04 00:11:08 4.140
38 2022-10-04 00:11:25 4.020
39 2022-10-04 00:11:43 3.872
40 2022-10-04 00:12:01 3.859
41 2022-10-04 00:12:17 4.062
42 2022-10-04 00:12:34 3.861
43 2022-10-04 00:12:51 3.780
44 2022-10-04 00:13:07 3.680
45 2022-10-04 00:13:25 3.909
46 2022-10-04 00:13:42 3.852
47 2022-10-04 00:13:58 3.867
48 2022-10-04 00:14:15 3.715
49 2022-10-04 00:14:32 3.534
50 2022-10-04 00:14:49 3.349
51 2022-10-04 00:15:06 3.213
52 2022-10-04 00:15:23 3.215
53 2022-10-04 00:15:39 3.246
54 2022-10-04 00:15:55 3.195
55 2022-10-04 00:16:14 3.164
56 2022-10-04 00:16:30 3.149
57 2022-10-04 00:16:47 3.281
58 2022-10-04 00:17:03 3.366
59 2022-10-04 00:17:20 3.295
60 2022-10-04 00:17:38 3.487
61 2022-10-04 00:17:54 3.534
62 2022-10-04 00:18:11 3.430
63 2022-10-04 00:18:27 3.474
64 2022-10-04 00:18:44 3.275
65 2022-10-04 00:19:01 3.584
66 2022-10-04 00:19:18 3.616
67 2022-10-04 00:19:34 3.506
68 2022-10-04 00:19:51 3.561
69 2022-10-04 00:20:08 3.316
70 2022-10-04 00:20:27 3.396
71 2022-10-04 00:20:43 3.536
72 2022-10-04 00:20:59 3.631
73 2022-10-04 00:21:16 3.573
74 2022-10-04 00:21:33 3.514
75 2022-10-04 00:21:50 3.603
76 2022-10-04 00:22:07 3.591
77 2022-10-04 00:22:23 3.591
78 2022-10-04 00:22:40 3.659
79 2022-10-04 00:23:14 4.056
df2:
time speed
0 2022-10-03 00:03:23 4.646689
1 2022-10-03 00:08:24 5.328516
2 2022-10-03 00:13:24 5.895778
3 2022-10-03 00:18:24 5.665014
4 2022-10-03 00:22:25 6.313763
What I know is that I can use pandas.merge_asof to align the times to be as close as possible with the 'tolerance' parameter. From there, I've been trying several combinations of pandas.groupby and pandas.rolling and am still struggling to get the desired result. I'm not sure how to proceed, so any help would be greatly appreciated.
IIUC, you can try something like this:
df1_rolling_mean = (
df1.rolling(window="300s", on="time", center=True)
.mean()
.rename(columns={"speed": "speed_avg"})
)
df1 = pd.merge(df1, df1_rolling_mean)
df = pd.merge_asof(
df1,
df2.rename(columns={"speed": "speed_df2"}),
on="time",
tolerance=pd.Timedelta(seconds=15),
)
df = df[df.loc[:, "speed_df2"].notna()]
This will result in:
time speed speed_avg speed_df2
11 2022-10-04 00:03:28 4.494 4.421765 4.646689
28 2022-10-04 00:08:34 4.471 4.265625 5.328516
45 2022-10-04 00:13:25 3.909 3.687167 5.895778
63 2022-10-04 00:18:27 3.474 3.410000 5.665014
78 2022-10-04 00:22:40 3.659 3.615000 6.313763
speed_avg
is the 5-min average of column speed
from df1
;df1
and df2
had common values for time
when merging with merge_asof
on column time
, with a tolerance of 15 seconds.