How to do rolling() grouped by day by hour in Polars?

Let's say you have a dataset with the following column, Node, Time, date, Hour, Amount. Time is spaced hourly. date is simply the current date corresponding to Time. You should have only one instance of a Node per hour.

Every day, for every Hour I want the average of Amount for the past 31 days in this hour. Here is the first thing I tried:

df.rolling('Time', period='31d', group_by=['date', 'Hour']).agg(pl.col('Amount').mean())

I expected to get back one row per Hour per day, but I got the same amount of rows as the original df:

What is it calculating exactly here and how can I do what I want it to do?

Thanks!

Solution

The answer from @keraion is the closest, but unfortunately it is super slow when you have a lot of rows and also fill up 128GB of ram in 5s. 2000 nodes per hour, which gives around 17_520_000 rows per year...

Here is the solution I came up with which is fast. I calculate the mean in two steps. First group_by date and hour and calculate the sum of Amount and the count. This is only for a single day though. Then I can do a rolling of the last 31 days and calculate the mean by using the accumulated sum and count I had calculated by day:

(
    df
    .lazy()
    .group_by('date', 'Hour')
    .agg(
        AmountSum = pl.col('Amount').sum(),
        AmountCount = pl.len()
    )
    .sort('date', 'Hour')
    .rolling('date', period=f'31d', group_by=['Hour'])
    .agg(
        (pl.col('AmountSum').sum() / pl.col('AmountCount').sum()).alias('AmountMean ')
    )
    .with_columns(
        Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
    )
    .collect()
)

This gives us the mean by date by Hour for the last 31 days. This is super fast even for millions of rows. Sub 1s. I also wanted to calculate quantiles though and my trick of accumulating sums and counts per day could not work here, so instead I accumulate list of numbers and flatten it in the rolling:

(df.lazy()
    .group_by('date', 'Hour')
    .agg(
        pl.col('Amount')
    )
    .sort('date', 'Hour')
    .rolling('date', period=f'31d', group_by=['Hour'])
    .agg(
        AmountMedian = pl.col('Amount').flatten().quantile(0.5)
    )
    .with_columns(
        Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
    )
    .collect()
)

This last solution is not as fast (takes ~11s), but allows me to calculate any function in the rolling.