Search code examples
pythonpython-polars

How to do rolling() grouped by day by hour in Polars?


Let's say you have a dataset with the following column, Node, Time, date, Hour, Amount. Time is spaced hourly. date is simply the current date corresponding to Time. You should have only one instance of a Node per hour.

Every day, for every Hour I want the average of Amount for the past 31 days in this hour. Here is the first thing I tried:

df.rolling('Time', period='31d', group_by=['date', 'Hour']).agg(pl.col('Amount').mean())

I expected to get back one row per Hour per day, but I got the same amount of rows as the original df:

enter image description here

What is it calculating exactly here and how can I do what I want it to do?

Thanks!


Solution

  • The answer from @keraion is the closest, but unfortunately it is super slow when you have a lot of rows and also fill up 128GB of ram in 5s. 2000 nodes per hour, which gives around 17_520_000 rows per year...

    Here is the solution I came up with which is fast. I calculate the mean in two steps. First group_by date and hour and calculate the sum of Amount and the count. This is only for a single day though. Then I can do a rolling of the last 31 days and calculate the mean by using the accumulated sum and count I had calculated by day:

    (
        df
        .lazy()
        .group_by('date', 'Hour')
        .agg(
            AmountSum = pl.col('Amount').sum(),
            AmountCount = pl.len()
        )
        .sort('date', 'Hour')
        .rolling('date', period=f'31d', group_by=['Hour'])
        .agg(
            (pl.col('AmountSum').sum() / pl.col('AmountCount').sum()).alias('AmountMean ')
        )
        .with_columns(
            Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
        )
        .collect()
    )
    

    This gives us the mean by date by Hour for the last 31 days. This is super fast even for millions of rows. Sub 1s. I also wanted to calculate quantiles though and my trick of accumulating sums and counts per day could not work here, so instead I accumulate list of numbers and flatten it in the rolling:

    (df.lazy()
        .group_by('date', 'Hour')
        .agg(
            pl.col('Amount')
        )
        .sort('date', 'Hour')
        .rolling('date', period=f'31d', group_by=['Hour'])
        .agg(
            AmountMedian = pl.col('Amount').flatten().quantile(0.5)
        )
        .with_columns(
            Time = (pl.col('date').cast(pl.Datetime('ns')) + pl.duration(hours=pl.col('Hour')))
        )
        .collect()
    )
    

    This last solution is not as fast (takes ~11s), but allows me to calculate any function in the rolling.