Search code examples
python-polars

Correct way to use polars take_every with rolling_sum


Objective: With a timeseries of 1 second frequency, for every second, look up the values 1 min, 2 min, 3 min, 4 min, 5 min ago and sum the 5 values. For example, the value calculated for 23:59:59 would be the sum of values at 23:59:59 23:58:59 23:57:59 23:56:59 23:55:59. The value calculated for 23:59:58 would be the sum of values at 23:59:58 23:58:58 23:57:58 23:56:58 23:55:58, and so on.

What is the correct way to use polars to calculate this?

I tried take_every(60).rolling_sum(5), which had ShapeError. Then I tried to use take_every and rolling_sum separately, then concat but the dataframe shows null at the top and bottom, which means the rows were added randomly. In out 7, values_take_every column should have null at the top and values at the bottom to be correct.

Please see code and outputs from notebook below:

Code 1

import polars as pl
import numpy as np
from datetime import datetime, timedelta

start = datetime(2023, 1, 1)
seconds = 86400
end = start + timedelta(seconds=seconds-1)
values = np.arange(1, seconds + 1)

df = pl.DataFrame({
    'dt': pl.date_range(start=start, end=end, interval="1s", eager=True),
    'values': values
})
print(df)

Out 1

shape: (86_400, 2)
┌─────────────────────┬────────┐
│ dt                  ┆ values │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ i64    │
╞═════════════════════╪════════╡
│ 2023-01-01 00:00:00 ┆ 1      │
│ 2023-01-01 00:00:01 ┆ 2      │
│ 2023-01-01 00:00:02 ┆ 3      │
│ 2023-01-01 00:00:03 ┆ 4      │
│ …                   ┆ …      │
│ 2023-01-01 23:59:56 ┆ 86397  │
│ 2023-01-01 23:59:57 ┆ 86398  │
│ 2023-01-01 23:59:58 ┆ 86399  │
│ 2023-01-01 23:59:59 ┆ 86400  │
└─────────────────────┴────────┘

Code 2

df = df.with_columns(pl.col("values").take_every(60).rolling_sum(5).suffix("_take_every_rolling_sum"))

Out 2

ShapeError: unable to add a column of length 1440 to a dataframe of height 86400

Code 3

take_every_df = df.select("values").take_every(60)
take_every_df = take_every_df.rename({"values": "values_take_every"})
print(take_every_df)

Out 3

shape: (1_440, 1)
┌───────────────────┐
│ values_take_every │
│ ---               │
│ i64               │
╞═══════════════════╡
│ 1                 │
│ 61                │
│ 121               │
│ 181               │
│ …                 │
│ 86161             │
│ 86221             │
│ 86281             │
│ 86341             │
└───────────────────┘

Code 4

rolling_sum_df = take_every_df.select("values_take_every").rolling_sum(5)
print(rolling_sum_df)

Out 4

AttributeError: 'DataFrame' object has no attribute 'rolling_sum'

Code 5

rolling_sum_series = take_every_df.select("values_take_every").to_series().rolling_sum(5)
print(rolling_sum_series)

Out 5

shape: (1_440,)
Series: 'values_take_every' [i64]
[
    null
    null
    null
    null
    605
    905
    1205
    1505
    1805
    2105
    2405
    2705
    …
    427505
    427805
    428105
    428405
    428705
    429005
    429305
    429605
    429905
    430205
    430505
    430805
    431105
]

Code 6

rolling_sum_df = pl.DataFrame({
    'values_take_every_rolling_sum': rolling_sum_series
})
print(rolling_sum_df)

Out 6

shape: (1_440, 1)
┌───────────────────────────────┐
│ values_take_every_rolling_sum │
│ ---                           │
│ i64                           │
╞═══════════════════════════════╡
│ null                          │
│ null                          │
│ null                          │
│ null                          │
│ …                             │
│ 430205                        │
│ 430505                        │
│ 430805                        │
│ 431105                        │
└───────────────────────────────┘

Code 7

df2 = pl.concat([df, take_every_df, rolling_sum_df], how="horizontal")
print(df2)
print(df2.describe())

Out 7

shape: (86_400, 4)
┌─────────────────────┬────────┬───────────────────┬───────────────────────────────┐
│ dt                  ┆ values ┆ values_take_every ┆ values_take_every_rolling_sum │
│ ---                 ┆ ---    ┆ ---               ┆ ---                           │
│ datetime[μs]        ┆ i64    ┆ i64               ┆ i64                           │
╞═════════════════════╪════════╪═══════════════════╪═══════════════════════════════╡
│ 2023-01-01 00:00:00 ┆ 1      ┆ 1                 ┆ null                          │
│ 2023-01-01 00:00:01 ┆ 2      ┆ 61                ┆ null                          │
│ 2023-01-01 00:00:02 ┆ 3      ┆ 121               ┆ null                          │
│ 2023-01-01 00:00:03 ┆ 4      ┆ 181               ┆ null                          │
│ …                   ┆ …      ┆ …                 ┆ …                             │
│ 2023-01-01 23:59:56 ┆ 86397  ┆ null              ┆ null                          │
│ 2023-01-01 23:59:57 ┆ 86398  ┆ null              ┆ null                          │
│ 2023-01-01 23:59:58 ┆ 86399  ┆ null              ┆ null                          │
│ 2023-01-01 23:59:59 ┆ 86400  ┆ null              ┆ null                          │
└─────────────────────┴────────┴───────────────────┴───────────────────────────────┘
shape: (9, 5)
┌────────────┬─────────────────────┬──────────────┬───────────────────┬────────────────────────────┐
│ describe   ┆ dt                  ┆ values       ┆ values_take_every ┆ values_take_every_rolling_ │
│ ---        ┆ ---                 ┆ ---          ┆ ---               ┆ sum                        │
│ str        ┆ str                 ┆ f64          ┆ f64               ┆ ---                        │
│            ┆                     ┆              ┆                   ┆ f64                        │
╞════════════╪═════════════════════╪══════════════╪═══════════════════╪════════════════════════════╡
│ count      ┆ 86400               ┆ 86400.0      ┆ 86400.0           ┆ 86400.0                    │
│ null_count ┆ 0                   ┆ 0.0          ┆ 84960.0           ┆ 84964.0                    │
│ mean       ┆ null                ┆ 43200.5      ┆ 43171.0           ┆ 215855.0                   │
│ std        ┆ null                ┆ 24941.675966 ┆ 24950.19038       ┆ 124404.541718              │
│ min        ┆ 2023-01-01 00:00:00 ┆ 1.0          ┆ 1.0               ┆ 605.0                      │
│ max        ┆ 2023-01-01 23:59:59 ┆ 86400.0      ┆ 86341.0           ┆ 431105.0                   │
│ median     ┆ null                ┆ 43200.5      ┆ 43171.0           ┆ 215855.0                   │
│ 25%        ┆ null                ┆ 21601.0      ┆ 21601.0           ┆ 108305.0                   │
│ 75%        ┆ null                ┆ 64801.0      ┆ 64801.0           ┆ 323705.0                   │
└────────────┴─────────────────────┴──────────────┴───────────────────┴────────────────────────────┘

Solution

  • Can you shift back 60 rows for each minute and sum them?

    df.with_columns(
        pl.sum_horizontal(pl.col("values").shift(60 * N) for N in range(5))
    )
    
    shape: (86_400, 3)
    ┌─────────────────────┬────────┬────────┐
    │ dt                  ┆ values ┆ sum    │
    │ ---                 ┆ ---    ┆ ---    │
    │ datetime[μs]        ┆ i64    ┆ i64    │
    ╞═════════════════════╪════════╪════════╡
    │ 2023-01-01 00:00:00 ┆ 1      ┆ null   │
    │ 2023-01-01 00:00:01 ┆ 2      ┆ null   │
    │ 2023-01-01 00:00:02 ┆ 3      ┆ null   │
    │ 2023-01-01 00:00:03 ┆ 4      ┆ null   │
    │ …                   ┆ …      ┆ …      │
    │ 2023-01-01 23:59:56 ┆ 86397  ┆ 431385 │
    │ 2023-01-01 23:59:57 ┆ 86398  ┆ 431390 │
    │ 2023-01-01 23:59:58 ┆ 86399  ┆ 431395 │
    │ 2023-01-01 23:59:59 ┆ 86400  ┆ 431400 │
    └─────────────────────┴────────┴────────┘