Search code examples
pythonpython-polars

Polars dataframe: rolling_sum look ahead


I want to calculate rolling_sum, but not over x rows above the current row, but over the x rows below the current row.

My solution is to sort the dataframe with descending=True before applying the rolling_sum and sort back to descending=False.

My solution:

import polars as pl

# Dummy dataset
df = pl.DataFrame({
        "Date": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
        "Close": [-1, 1, 2, 3, 4, 4, 3, 2, 1, -1],
        "Company": ["A", "A", "A","A", "A",  "B", "B", "B", "B", "B"]
    })

# Solution using sort twice

(
    df
    .sort(by=["Company", "Date"], descending=[True, True])
    .with_columns(
        pl.col("Close").rolling_sum(3).over("Company").alias("Cumsum_lead")
    )
    .sort(by=["Company", "Date"], descending=[False, False])
)

Is there a better solution?

With better I mean:

  • more computational efficient and/or
  • less code / easier to read

Thanks!

EDIT:

I just thought of one other solution which is avoids sorting / reversing the column altogether: using shift

(
    df
    .with_columns(
        pl.col("Close")
      .rolling_sum(3)
      .shift(-2)
      .over("Company").alias("Cumsum_lead"))
)

Solution

  • You can avoid sorting the rows and instead reverse the specific column twice using pl.Expr.reverse.

    (
        df
        .with_columns(
            pl.col("Close")
            .reverse().rolling_sum(3).reverse()
            .over("Company").alias("Cumsum_lead")
        )
    )
    

    For readability, this could also be wrapped into a helper function.

    def rolling_sum_lead(expr: pl.Expr, window_size: int) -> pl.Expr:
        return expr.reverse().rolling_sum(window_size).reverse()
    
    (
        df
        .with_columns(
            rolling_sum_lead(pl.col("Close"), 3).over("Company").alias("Cumsum_lead")
        )
    )
    

    Note. On my machine, this takes 124 µs ± 5.67 µs per loop in contrast to 205 µs ± 6.9 µs per loop for the solution using pl.DataFrame.sort.