Search code examples
python-polars

Correct way to use polars gather_every with rolling_sum


Objective: With a timeseries of 1 second frequency, for every second, look up the values 1 min, 2 min, 3 min, 4 min, 5 min ago and sum the 5 values. For example, the value calculated for 23:59:59 would be the sum of values at 23:59:59 23:58:59 23:57:59 23:56:59 23:55:59. The value calculated for 23:59:58 would be the sum of values at 23:59:58 23:58:58 23:57:58 23:56:58 23:55:58, and so on.

What is the correct way to use polars to calculate this?

I tried gather_every(60).rolling_sum(5), which had ShapeError. Then I tried to use gather_every and rolling_sum separately, then concat but the dataframe shows null at the top and bottom, which means the rows were added randomly. In out 7, values_gather_every column should have null at the top and values at the bottom to be correct.

Please see code and outputs from notebook below:

Code 1

import polars as pl
import numpy as np
from datetime import datetime, timedelta

start = datetime(2023, 1, 1)
seconds = 86400
end = start + timedelta(seconds=seconds-1)
values = np.arange(1, seconds + 1)

df = pl.DataFrame({
    'dt': pl.datetime_range(start=start, end=end, interval="1s", eager=True),
    'values': values
})
print(df)

Out 1

shape: (86_400, 2)
┌─────────────────────┬────────┐
│ dt                  ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2023-01-01 00:00:00 ┆ 1      │
│ 2023-01-01 00:00:01 ┆ 2      │
│ 2023-01-01 00:00:02 ┆ 3      │
│ 2023-01-01 00:00:03 ┆ 4      │
│ …                   ┆ …      │
│ 2023-01-01 23:59:56 ┆ 86397  │
│ 2023-01-01 23:59:57 ┆ 86398  │
│ 2023-01-01 23:59:58 ┆ 86399  │
│ 2023-01-01 23:59:59 ┆ 86400  │
└─────────────────────┴────────┘

Code 2

df = df.with_columns(pl.col("values").gather_every(60).rolling_sum(5).name.suffix("_gather_every_rolling_sum"))

Out 2

ShapeError: unable to add a column of length 1440 to a dataframe of height 86400

Code 3

gather_every_df = df.select("values").gather_every(60)
gather_every_df = gather_every_df.rename({"values": "values_gather_every"})
print(gather_every_df)

Out 3

shape: (1_440, 1)
┌─────────────────────┐
│ values_gather_every │
│ ---                 │
│ i64                 │
╞═════════════════════╡
│ 1                   │
│ 61                  │
│ 121                 │
│ 181                 │
│ 241                 │
│ …                   │
│ 86101               │
│ 86161               │
│ 86221               │
│ 86281               │
│ 86341               │
└─────────────────────┘

Code 4

rolling_sum_df = gather_every_df.select("values_gather_every").rolling_sum(5)
print(rolling_sum_df)

Out 4

AttributeError: 'DataFrame' object has no attribute 'rolling_sum'

Code 5

rolling_sum_series = gather_every_df.select("values_gather_every").to_series().rolling_sum(5)
print(rolling_sum_series)

Out 5

shape: (1_440,)
Series: 'values_gather_every' [i64]
[
    null
    null
    null
    null
    605
    905
    1205
    1505
    1805
    2105
    2405
    2705
    …
    427505
    427805
    428105
    428405
    428705
    429005
    429305
    429605
    429905
    430205
    430505
    430805
    431105
]

Code 6

rolling_sum_df = pl.DataFrame({
    'values_gather_every_rolling_sum': rolling_sum_series
})
print(rolling_sum_df)

Out 6

shape: (1_440, 1)
┌─────────────────────────────────┐
│ values_gather_every_rolling_su… │
│ ---                             │
│ i64                             │
╞═════════════════════════════════╡
│ null                            │
│ null                            │
│ null                            │
│ null                            │
│ 605                             │
│ …                               │
│ 429905                          │
│ 430205                          │
│ 430505                          │
│ 430805                          │
│ 431105                          │
└─────────────────────────────────┘

Code 7

df2 = pl.concat([df, gather_every_df, rolling_sum_df], how="horizontal")
print(df2)
print(df2.describe())

Out 7

shape: (86_400, 4)
┌─────────────────────┬────────┬─────────────────────┬─────────────────────────────────┐
│ dt                  ┆ values ┆ values_gather_every ┆ values_gather_every_rolling_su… │
│ ---                 ┆ ---    ┆ ---                 ┆ ---                             │
│ datetime[μs]        ┆ i64    ┆ i64                 ┆ i64                             │
╞═════════════════════╪════════╪═════════════════════╪═════════════════════════════════╡
│ 2023-01-01 00:00:00 ┆ 1      ┆ 1                   ┆ null                            │
│ 2023-01-01 00:00:01 ┆ 2      ┆ 61                  ┆ null                            │
│ 2023-01-01 00:00:02 ┆ 3      ┆ 121                 ┆ null                            │
│ 2023-01-01 00:00:03 ┆ 4      ┆ 181                 ┆ null                            │
│ 2023-01-01 00:00:04 ┆ 5      ┆ 241                 ┆ 605                             │
│ …                   ┆ …      ┆ …                   ┆ …                               │
│ 2023-01-01 23:59:55 ┆ 86396  ┆ null                ┆ null                            │
│ 2023-01-01 23:59:56 ┆ 86397  ┆ null                ┆ null                            │
│ 2023-01-01 23:59:57 ┆ 86398  ┆ null                ┆ null                            │
│ 2023-01-01 23:59:58 ┆ 86399  ┆ null                ┆ null                            │
│ 2023-01-01 23:59:59 ┆ 86400  ┆ null                ┆ null                            │
└─────────────────────┴────────┴─────────────────────┴─────────────────────────────────┘
shape: (9, 5)
┌────────────┬────────────────────────────┬──────────────┬─────────────────────┬─────────────────────────────────┐
│ statistic  ┆ dt                         ┆ values       ┆ values_gather_every ┆ values_gather_every_rolling_su… │
│ ---        ┆ ---                        ┆ ---          ┆ ---                 ┆ ---                             │
│ str        ┆ str                        ┆ f64          ┆ f64                 ┆ f64                             │
╞════════════╪════════════════════════════╪══════════════╪═════════════════════╪═════════════════════════════════╡
│ count      ┆ 86400                      ┆ 86400.0      ┆ 1440.0              ┆ 1436.0                          │
│ null_count ┆ 0                          ┆ 0.0          ┆ 84960.0             ┆ 84964.0                         │
│ mean       ┆ 2023-01-01 11:59:59.500000 ┆ 43200.5      ┆ 43171.0             ┆ 215855.0                        │
│ std        ┆ null                       ┆ 24941.675966 ┆ 24950.19038         ┆ 124404.541718                   │
│ min        ┆ 2023-01-01 00:00:00        ┆ 1.0          ┆ 1.0                 ┆ 605.0                           │
│ 25%        ┆ 2023-01-01 06:00:00        ┆ 21601.0      ┆ 21601.0             ┆ 108305.0                        │
│ 50%        ┆ 2023-01-01 12:00:00        ┆ 43201.0      ┆ 43201.0             ┆ 216005.0                        │
│ 75%        ┆ 2023-01-01 17:59:59        ┆ 64800.0      ┆ 64741.0             ┆ 323405.0                        │
│ max        ┆ 2023-01-01 23:59:59        ┆ 86400.0      ┆ 86341.0             ┆ 431105.0                        │
└────────────┴────────────────────────────┴──────────────┴─────────────────────┴─────────────────────────────────┘

Solution

  • Can you shift back 60 rows for each minute and sum them?

    df.with_columns(
        pl.sum_horizontal(pl.col("values").shift(60 * N) for N in range(5))
    )
    
    shape: (86_400, 3)
    ┌─────────────────────┬────────┬────────┐
    │ dt                  ┆ values ┆ sum    │
    │ ---                 ┆ ---    ┆ ---    │
    │ datetime[μs]        ┆ i64    ┆ i64    │
    ╞═════════════════════╪════════╪════════╡
    │ 2023-01-01 00:00:00 ┆ 1      ┆ null   │
    │ 2023-01-01 00:00:01 ┆ 2      ┆ null   │
    │ 2023-01-01 00:00:02 ┆ 3      ┆ null   │
    │ 2023-01-01 00:00:03 ┆ 4      ┆ null   │
    │ …                   ┆ …      ┆ …      │
    │ 2023-01-01 23:59:56 ┆ 86397  ┆ 431385 │
    │ 2023-01-01 23:59:57 ┆ 86398  ┆ 431390 │
    │ 2023-01-01 23:59:58 ┆ 86399  ┆ 431395 │
    │ 2023-01-01 23:59:59 ┆ 86400  ┆ 431400 │
    └─────────────────────┴────────┴────────┘