Let's say you have a dataset with the following column, Node, Time, date, Hour, Amount. Time is spaced hourly. date is simply the current date corresponding to Time. You should have only one instance of a Node per hour.
Every day, for every Hour I want the average of Amount for the past 31 days in this hour. Here is the first thing I tried:
df.rolling('Time', period='31d', group_by=['date', 'Hour']).agg(pl.col('Amount').mean())
I expected to get back one row per Hour per day, but I got the same amount of rows as the original df:
What is it calculating exactly here and how can I do what I want it to do?
Thanks!
The answer from @keraion is the closest, but unfortunately it is super slow when you have a lot of rows and also fill up 128GB of ram in 5s. 2000 nodes per hour, which gives around 17_520_000 rows per year...
Here is the solution I came up with which is fast. I calculate the mean in two steps. First group_by date and hour and calculate the sum of Amount and the count. This is only for a single day though. Then I can do a rolling of the last 31 days and calculate the mean by using the accumulated sum and count I had calculated by day:
(
df
.lazy()
.group_by('date', 'Hour')
.agg(
AmountSum = pl.col('Amount').sum(),
AmountCount = pl.len()
)
.sort('date', 'Hour')
.rolling('date', period=f'31d', group_by=['Hour'])
.agg(
(pl.col('AmountSum').sum() / pl.col('AmountCount').sum()).alias('AmountMean ')
)
.with_columns(
Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
)
.collect()
)
This gives us the mean by date by Hour for the last 31 days. This is super fast even for millions of rows. Sub 1s. I also wanted to calculate quantiles though and my trick of accumulating sums and counts per day could not work here, so instead I accumulate list of numbers and flatten it in the rolling:
(df.lazy()
.group_by('date', 'Hour')
.agg(
pl.col('Amount')
)
.sort('date', 'Hour')
.rolling('date', period=f'31d', group_by=['Hour'])
.agg(
AmountMedian = pl.col('Amount').flatten().quantile(0.5)
)
.with_columns(
Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
)
.collect()
)
This last solution is not as fast (takes ~11s), but allows me to calculate any function in the rolling.