I have been fiddling about with pandas.DataFrame.rolling
for some time now and I haven't been able to achieve the result that I am looking for, so before I write a custom windowing function I figured I would ask if I'm missing something.
I have postgresql data with a composite index of (time, node)
that has been read into a pandas.DataFrame
, where time
is a certain hour on a certain date. I need to create windows that contain all entries within the last two calendar dates (or any arbitrary number of days), for example, beginning at 2022-12-26 00:00:00
and ending on 2022-12-27 23:00:00
, and then perform operations on that window to return a new, resultant DataFrame. The window should then move forward an entire calendar date, which is where I am failing.
| time | node | value |
| --------------------- | ----- | ------ |
| 2022-12-26 00:00:00 | 123 | low |
| 2022-12-26 01:00:00 | 123 | med |
| 2022-12-26 02:00:00 | 123 | low |
| 2022-12-26 03:00:00 | 123 | high |
| ... | ... | ... |
| 2022-12-26 00:00:00 | 999 | low |
| 2022-12-26 01:00:00 | 999 | low |
| 2022-12-26 02:00:00 | 999 | low |
| 2022-12-26 03:00:00 | 999 | med |
| ... | ... | ... |
| 2022-12-27 00:00:00 | 123 | low |
| 2022-12-27 01:00:00 | 123 | med |
| 2022-12-27 02:00:00 | 123 | low |
| 2022-12-27 03:00:00 | 123 | high |
When I use something akin to df.rolling(window=pd.Timedelta('2days')
, the windows move forward hour-by-hour, as opposed to beginning on the next calendar date.
I've played around with using min_periods
, but it doesn't seem to work with my data, nor would it be acceptable in the long run because the number of expected observations per window is not fixed regardless. The step
parameter also appears to be useless in this case because I am using an offset versus an integer for the window anyways.
Is the behaviour I am looking for doable with pandas.DataFrame.rolling
or must I look elsewhere/write my own windowing function?
Any guidance would be appreciated. Thanks!
So from what I understand, you want to create windows of length ndays
and the next window should start with the next day.
Given some dataframe with 5 days in total in the frequency of 1H between indices:
import pandas as pd
import numpy as np
periods = 23 * 5
df = pd.DataFrame(
{'value': list(range(periods))},
index=pd.date_range('2022-12-16', periods=periods, freq='H')
)
d = np.random.choice(
pd.date_range('2022-12-16', periods=periods, freq='H'),
int(periods * 0.25)
)
df = df.drop(index=d)
df.head(5)
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
I randomly dropped some indices to simulate missing data.
We can use df.resample
(docs) to group the data by days (regardless of missing data):
days = df.resample('1d')
print(days.get_group('2022-12-16'))
>>> value
2022-12-16 00:00:00 0
2022-12-16 01:00:00 1
2022-12-16 02:00:00 2
2022-12-16 04:00:00 4
2022-12-16 05:00:00 5
2022-12-16 06:00:00 6
2022-12-16 07:00:00 7
2022-12-16 08:00:00 8
2022-12-16 09:00:00 9
2022-12-16 11:00:00 11
2022-12-16 12:00:00 12
2022-12-16 13:00:00 13
2022-12-16 14:00:00 14
2022-12-16 15:00:00 15
2022-12-16 17:00:00 17
2022-12-16 18:00:00 18
2022-12-16 19:00:00 19
2022-12-16 21:00:00 21
2022-12-16 22:00:00 22
2022-12-16 23:00:00 23
Now, we only need to iterate over the days in a "sliding" manner. The package more-itertools has some helpful functions, such as windowed and we can easily control the size of the window (here with ndays
):
from more_itertools import windowed
ndays = 2
windows = [
pd.concat([w[1] for w in window])
for window in windowed(days, ndays)
]
Printing the first and last index of each window returns:
for window in windows:
print(window.iloc[[0, -1]])
>>> value
2022-12-16 00:00:00 0
2022-12-17 23:00:00 47
value
2022-12-17 00:00:00 24
2022-12-18 23:00:00 71
value
2022-12-18 00:00:00 48
2022-12-19 23:00:00 95
value
2022-12-19 01:00:00 73
2022-12-20 18:00:00 114
Furthermore, you can set step
in windowed
to control the step size between windows.