Objective: With a timeseries of 1 second frequency, for every second, look up the values 1 min, 2 min, 3 min, 4 min, 5 min ago and sum the 5 values. For example, the value calculated for 23:59:59 would be the sum of values at 23:59:59 23:58:59 23:57:59 23:56:59 23:55:59. The value calculated for 23:59:58 would be the sum of values at 23:59:58 23:58:58 23:57:58 23:56:58 23:55:58, and so on.
What is the correct way to use polars to calculate this?
I tried gather_every(60).rolling_sum(5)
, which had ShapeError
. Then I tried to use gather_every
and rolling_sum
separately, then concat
but the dataframe shows null at the top and bottom, which means the rows were added randomly.
In out 7, values_gather_every
column should have null at the top and values at the bottom to be correct.
Please see code and outputs from notebook below:
Code 1
import polars as pl
import numpy as np
from datetime import datetime, timedelta
start = datetime(2023, 1, 1)
seconds = 86400
end = start + timedelta(seconds=seconds-1)
values = np.arange(1, seconds + 1)
df = pl.DataFrame({
'dt': pl.datetime_range(start=start, end=end, interval="1s", eager=True),
'values': values
})
print(df)
Out 1
shape: (86_400, 2)
┌─────────────────────┬────────┐
│ dt ┆ values │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪════════╡
│ 2023-01-01 00:00:00 ┆ 1 │
│ 2023-01-01 00:00:01 ┆ 2 │
│ 2023-01-01 00:00:02 ┆ 3 │
│ 2023-01-01 00:00:03 ┆ 4 │
│ … ┆ … │
│ 2023-01-01 23:59:56 ┆ 86397 │
│ 2023-01-01 23:59:57 ┆ 86398 │
│ 2023-01-01 23:59:58 ┆ 86399 │
│ 2023-01-01 23:59:59 ┆ 86400 │
└─────────────────────┴────────┘
Code 2
df = df.with_columns(pl.col("values").gather_every(60).rolling_sum(5).name.suffix("_gather_every_rolling_sum"))
Out 2
ShapeError: unable to add a column of length 1440 to a dataframe of height 86400
Code 3
gather_every_df = df.select("values").gather_every(60)
gather_every_df = gather_every_df.rename({"values": "values_gather_every"})
print(gather_every_df)
Out 3
shape: (1_440, 1)
┌─────────────────────┐
│ values_gather_every │
│ --- │
│ i64 │
╞═════════════════════╡
│ 1 │
│ 61 │
│ 121 │
│ 181 │
│ 241 │
│ … │
│ 86101 │
│ 86161 │
│ 86221 │
│ 86281 │
│ 86341 │
└─────────────────────┘
Code 4
rolling_sum_df = gather_every_df.select("values_gather_every").rolling_sum(5)
print(rolling_sum_df)
Out 4
AttributeError: 'DataFrame' object has no attribute 'rolling_sum'
Code 5
rolling_sum_series = gather_every_df.select("values_gather_every").to_series().rolling_sum(5)
print(rolling_sum_series)
Out 5
shape: (1_440,)
Series: 'values_gather_every' [i64]
[
null
null
null
null
605
905
1205
1505
1805
2105
2405
2705
…
427505
427805
428105
428405
428705
429005
429305
429605
429905
430205
430505
430805
431105
]
Code 6
rolling_sum_df = pl.DataFrame({
'values_gather_every_rolling_sum': rolling_sum_series
})
print(rolling_sum_df)
Out 6
shape: (1_440, 1)
┌─────────────────────────────────┐
│ values_gather_every_rolling_su… │
│ --- │
│ i64 │
╞═════════════════════════════════╡
│ null │
│ null │
│ null │
│ null │
│ 605 │
│ … │
│ 429905 │
│ 430205 │
│ 430505 │
│ 430805 │
│ 431105 │
└─────────────────────────────────┘
Code 7
df2 = pl.concat([df, gather_every_df, rolling_sum_df], how="horizontal")
print(df2)
print(df2.describe())
Out 7
shape: (86_400, 4)
┌─────────────────────┬────────┬─────────────────────┬─────────────────────────────────┐
│ dt ┆ values ┆ values_gather_every ┆ values_gather_every_rolling_su… │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════════╪════════╪═════════════════════╪═════════════════════════════════╡
│ 2023-01-01 00:00:00 ┆ 1 ┆ 1 ┆ null │
│ 2023-01-01 00:00:01 ┆ 2 ┆ 61 ┆ null │
│ 2023-01-01 00:00:02 ┆ 3 ┆ 121 ┆ null │
│ 2023-01-01 00:00:03 ┆ 4 ┆ 181 ┆ null │
│ 2023-01-01 00:00:04 ┆ 5 ┆ 241 ┆ 605 │
│ … ┆ … ┆ … ┆ … │
│ 2023-01-01 23:59:55 ┆ 86396 ┆ null ┆ null │
│ 2023-01-01 23:59:56 ┆ 86397 ┆ null ┆ null │
│ 2023-01-01 23:59:57 ┆ 86398 ┆ null ┆ null │
│ 2023-01-01 23:59:58 ┆ 86399 ┆ null ┆ null │
│ 2023-01-01 23:59:59 ┆ 86400 ┆ null ┆ null │
└─────────────────────┴────────┴─────────────────────┴─────────────────────────────────┘
shape: (9, 5)
┌────────────┬────────────────────────────┬──────────────┬─────────────────────┬─────────────────────────────────┐
│ statistic ┆ dt ┆ values ┆ values_gather_every ┆ values_gather_every_rolling_su… │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪════════════════════════════╪══════════════╪═════════════════════╪═════════════════════════════════╡
│ count ┆ 86400 ┆ 86400.0 ┆ 1440.0 ┆ 1436.0 │
│ null_count ┆ 0 ┆ 0.0 ┆ 84960.0 ┆ 84964.0 │
│ mean ┆ 2023-01-01 11:59:59.500000 ┆ 43200.5 ┆ 43171.0 ┆ 215855.0 │
│ std ┆ null ┆ 24941.675966 ┆ 24950.19038 ┆ 124404.541718 │
│ min ┆ 2023-01-01 00:00:00 ┆ 1.0 ┆ 1.0 ┆ 605.0 │
│ 25% ┆ 2023-01-01 06:00:00 ┆ 21601.0 ┆ 21601.0 ┆ 108305.0 │
│ 50% ┆ 2023-01-01 12:00:00 ┆ 43201.0 ┆ 43201.0 ┆ 216005.0 │
│ 75% ┆ 2023-01-01 17:59:59 ┆ 64800.0 ┆ 64741.0 ┆ 323405.0 │
│ max ┆ 2023-01-01 23:59:59 ┆ 86400.0 ┆ 86341.0 ┆ 431105.0 │
└────────────┴────────────────────────────┴──────────────┴─────────────────────┴─────────────────────────────────┘
Can you shift back 60
rows for each minute and sum them?
df.with_columns(
pl.sum_horizontal(pl.col("values").shift(60 * N) for N in range(5))
)
shape: (86_400, 3)
┌─────────────────────┬────────┬────────┐
│ dt ┆ values ┆ sum │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ i64 ┆ i64 │
╞═════════════════════╪════════╪════════╡
│ 2023-01-01 00:00:00 ┆ 1 ┆ null │
│ 2023-01-01 00:00:01 ┆ 2 ┆ null │
│ 2023-01-01 00:00:02 ┆ 3 ┆ null │
│ 2023-01-01 00:00:03 ┆ 4 ┆ null │
│ … ┆ … ┆ … │
│ 2023-01-01 23:59:56 ┆ 86397 ┆ 431385 │
│ 2023-01-01 23:59:57 ┆ 86398 ┆ 431390 │
│ 2023-01-01 23:59:58 ┆ 86399 ┆ 431395 │
│ 2023-01-01 23:59:59 ┆ 86400 ┆ 431400 │
└─────────────────────┴────────┴────────┘