Search code examples
pythonpandasdatetimemeancentering

Rolling average on pandas data frame where the center of mean is based on time from another data frame


I have one df with data roughly every 15 seconds (df1) and another roughly every 5 minutes (df2) that are listed below. I would like to have the "time" variable of df2 matched to the closest timestamp of df1, then have that closest timestamp as the center of 5-minute averaged df1 data.

df1:

                  time  speed
0  2022-10-04 00:00:24  4.590
1  2022-10-04 00:00:41  4.389
2  2022-10-04 00:00:57  4.367
3  2022-10-04 00:01:14  4.539
4  2022-10-04 00:01:30  4.584
5  2022-10-04 00:01:48  4.523
6  2022-10-04 00:02:05  4.498
7  2022-10-04 00:02:21  4.625
8  2022-10-04 00:02:38  4.497
9  2022-10-04 00:02:54  4.406
10 2022-10-04 00:03:12  4.502
11 2022-10-04 00:03:28  4.494
12 2022-10-04 00:03:45  4.445
13 2022-10-04 00:04:01  4.438
14 2022-10-04 00:04:18  4.433
15 2022-10-04 00:04:36  4.441
16 2022-10-04 00:04:52  4.400
17 2022-10-04 00:05:09  4.221
18 2022-10-04 00:05:27  4.115
19 2022-10-04 00:05:43  4.009
20 2022-10-04 00:06:01  4.230
21 2022-10-04 00:06:18  4.360
22 2022-10-04 00:06:34  4.331
23 2022-10-04 00:06:51  4.178
24 2022-10-04 00:07:07  4.238
25 2022-10-04 00:07:25  4.125
26 2022-10-04 00:07:43  3.988
27 2022-10-04 00:08:17  3.573
28 2022-10-04 00:08:34  4.471
29 2022-10-04 00:08:50  4.567
30 2022-10-04 00:09:08  4.451
31 2022-10-04 00:09:25  4.311
32 2022-10-04 00:09:42  4.280
33 2022-10-04 00:09:59  4.439
34 2022-10-04 00:10:17  4.410
35 2022-10-04 00:10:35  4.335
36 2022-10-04 00:10:51  4.193
37 2022-10-04 00:11:08  4.140
38 2022-10-04 00:11:25  4.020
39 2022-10-04 00:11:43  3.872
40 2022-10-04 00:12:01  3.859
41 2022-10-04 00:12:17  4.062
42 2022-10-04 00:12:34  3.861
43 2022-10-04 00:12:51  3.780
44 2022-10-04 00:13:07  3.680
45 2022-10-04 00:13:25  3.909
46 2022-10-04 00:13:42  3.852
47 2022-10-04 00:13:58  3.867
48 2022-10-04 00:14:15  3.715
49 2022-10-04 00:14:32  3.534
50 2022-10-04 00:14:49  3.349
51 2022-10-04 00:15:06  3.213
52 2022-10-04 00:15:23  3.215
53 2022-10-04 00:15:39  3.246
54 2022-10-04 00:15:55  3.195
55 2022-10-04 00:16:14  3.164
56 2022-10-04 00:16:30  3.149
57 2022-10-04 00:16:47  3.281
58 2022-10-04 00:17:03  3.366
59 2022-10-04 00:17:20  3.295
60 2022-10-04 00:17:38  3.487
61 2022-10-04 00:17:54  3.534
62 2022-10-04 00:18:11  3.430
63 2022-10-04 00:18:27  3.474
64 2022-10-04 00:18:44  3.275
65 2022-10-04 00:19:01  3.584
66 2022-10-04 00:19:18  3.616
67 2022-10-04 00:19:34  3.506
68 2022-10-04 00:19:51  3.561
69 2022-10-04 00:20:08  3.316
70 2022-10-04 00:20:27  3.396
71 2022-10-04 00:20:43  3.536
72 2022-10-04 00:20:59  3.631
73 2022-10-04 00:21:16  3.573
74 2022-10-04 00:21:33  3.514
75 2022-10-04 00:21:50  3.603
76 2022-10-04 00:22:07  3.591
77 2022-10-04 00:22:23  3.591
78 2022-10-04 00:22:40  3.659
79 2022-10-04 00:23:14  4.056

df2:

                 time     speed
0 2022-10-03 00:03:23  4.646689
1 2022-10-03 00:08:24  5.328516
2 2022-10-03 00:13:24  5.895778
3 2022-10-03 00:18:24  5.665014
4 2022-10-03 00:22:25  6.313763

What I know is that I can use pandas.merge_asof to align the times to be as close as possible with the 'tolerance' parameter. From there, I've been trying several combinations of pandas.groupby and pandas.rolling and am still struggling to get the desired result. I'm not sure how to proceed, so any help would be greatly appreciated.


Solution

  • IIUC, you can try something like this:

    df1_rolling_mean = (
        df1.rolling(window="300s", on="time", center=True)
        .mean()
        .rename(columns={"speed": "speed_avg"})
    )
    df1 = pd.merge(df1, df1_rolling_mean)
    
    df = pd.merge_asof(
        df1,
        df2.rename(columns={"speed": "speed_df2"}),
        on="time",
        tolerance=pd.Timedelta(seconds=15),
    )
    
    df = df[df.loc[:, "speed_df2"].notna()]
    

    This will result in:

                      time  speed  speed_avg  speed_df2
    11 2022-10-04 00:03:28  4.494   4.421765   4.646689
    28 2022-10-04 00:08:34  4.471   4.265625   5.328516
    45 2022-10-04 00:13:25  3.909   3.687167   5.895778
    63 2022-10-04 00:18:27  3.474   3.410000   5.665014
    78 2022-10-04 00:22:40  3.659   3.615000   6.313763
    
    • speed_avg is the 5-min average of column speed from df1;
    • We keep only the columns where df1 and df2 had common values for time when merging with merge_asof on column time, with a tolerance of 15 seconds.