I want to calculate rolling_sum
, but not over x rows above the current row, but over the x rows below the current row.
My solution is to sort the dataframe with descending=True
before applying the rolling_sum
and sort back to descending=False
.
My solution:
import polars as pl
# Dummy dataset
df = pl.DataFrame({
"Date": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
"Close": [-1, 1, 2, 3, 4, 4, 3, 2, 1, -1],
"Company": ["A", "A", "A","A", "A", "B", "B", "B", "B", "B"]
})
# Solution using sort twice
(
df
.sort(by=["Company", "Date"], descending=[True, True])
.with_columns(
pl.col("Close").rolling_sum(3).over("Company").alias("Cumsum_lead")
)
.sort(by=["Company", "Date"], descending=[False, False])
)
Is there a better solution?
With better I mean:
Thanks!
EDIT:
I just thought of one other solution which is avoids sorting / reversing the column altogether: using shift
(
df
.with_columns(
pl.col("Close")
.rolling_sum(3)
.shift(-2)
.over("Company").alias("Cumsum_lead"))
)
You can avoid sorting the rows and instead reverse the specific column twice using pl.Expr.reverse
.
(
df
.with_columns(
pl.col("Close")
.reverse().rolling_sum(3).reverse()
.over("Company").alias("Cumsum_lead")
)
)
For readability, this could also be wrapped into a helper function.
def rolling_sum_lead(expr: pl.Expr, window_size: int) -> pl.Expr:
return expr.reverse().rolling_sum(window_size).reverse()
(
df
.with_columns(
rolling_sum_lead(pl.col("Close"), 3).over("Company").alias("Cumsum_lead")
)
)
Note. On my machine, this takes 124 µs ± 5.67 µs per loop in contrast to 205 µs ± 6.9 µs per loop for the solution using pl.DataFrame.sort
.