Search code examples
pythonpython-3.xpython-polarsdate-arithmeticrolling-computation

How to calculate a rolling statistic in Polars, looking back from the `end_date`?


I would like to calculate a rolling "1m" statistic on a financial data time series. Given that there won't always be an equal number of rows, per month calculation, unless you have exactly enough data to divide into months, which is uncommon.

I am trying to assign a column window_index to keep track of the rows that are included in the calculation, as i will use .rolling().over() expression to calc the stat on each window. I would like the window_index integer to be added starting from the latest date, and working its way back.

Here is an image to help explain: Stock Data

Currently, what I am trying to do is add the window_index in groups denoted by the red and blue pen. Although I would like the data to be grouped by the yellow pen marking. The dataset ends at 2023-02-07 and one month prior would be 2023-01-07, or the closest value 2023-01-06.

This is the code I am using to achieve this, but I am not sure how to get the grouping window that I desire.

df_window_index = (
    data.group_by_dynamic(
        index_column="date", every="1m", by="symbol"
    )
    .agg()
    .with_columns(
        pl.int_range(0, pl.len()).over("symbol").alias("window_index")
    )
)
data = data.join_asof(df_window_index, on="date", by="symbol").sort(
    "symbol"
)

Using a timedelta instead of a string didn't seem to solve the issue.

On the left is the code above, and the image on the right is when I make every=timedelta(days=31). Is still doesnt make sense to me why Polars is pulling those dates for a 31 day timedelta.

Trying TimeDelta

Any help would be much appreciated, or any tips to point me in the right direction! Thanks!

Data:

df = pl.read_csv(b"""
date,open,high,low,close,volume,dividends,stock_splits,symbol,window_index
2021-01-04T00:00:00.000000000,133.52,133.61,126.76,129.41,143301900,0.0,0.0,AAPL,0
2021-01-05T00:00:00.000000000,128.89,131.74,128.43,131.01,97664900,0.0,0.0,AAPL,0
2021-01-06T00:00:00.000000000,127.72,131.05,126.38,126.6,155088000,0.0,0.0,AAPL,0
2021-01-07T00:00:00.000000000,128.36,131.63,127.86,130.92,109578200,0.0,0.0,AAPL,1
2021-01-08T00:00:00.000000000,132.43,132.63,130.23,132.05,105158200,0.0,0.0,AAPL,1
2021-01-11T00:00:00.000000000,129.19,130.17,128.5,128.98,100384500,0.0,0.0,AAPL,1
2021-01-12T00:00:00.000000000,128.5,129.69,126.86,128.8,91951100,0.0,0.0,AAPL,1
2021-01-13T00:00:00.000000000,128.76,131.45,128.49,130.89,88636800,0.0,0.0,AAPL,1
2021-01-14T00:00:00.000000000,130.8,131.0,128.76,128.91,90221800,0.0,0.0,AAPL,1
2021-01-15T00:00:00.000000000,128.78,130.22,127.0,127.14,111598500,0.0,0.0,AAPL,1
2021-01-19T00:00:00.000000000,127.78,128.71,126.94,127.83,90757300,0.0,0.0,AAPL,1
2021-01-20T00:00:00.000000000,128.66,132.49,128.55,132.03,104319500,0.0,0.0,AAPL,1
2021-01-21T00:00:00.000000000,133.8,139.67,133.59,136.87,120150900,0.0,0.0,AAPL,1
2021-01-22T00:00:00.000000000,136.28,139.85,135.02,139.07,114459400,0.0,0.0,AAPL,1
2021-01-25T00:00:00.000000000,143.07,145.09,136.54,142.92,157611700,0.0,0.0,AAPL,1
2021-01-26T00:00:00.000000000,143.6,144.3,141.37,143.16,98390600,0.0,0.0,AAPL,1
2021-01-27T00:00:00.000000000,143.43,144.3,140.41,142.06,140843800,0.0,0.0,AAPL,1
2021-01-28T00:00:00.000000000,139.52,141.99,136.7,137.09,142621100,0.0,0.0,AAPL,1
2021-01-29T00:00:00.000000000,135.83,136.74,130.21,131.96,177523800,0.0,0.0,AAPL,1
2021-02-01T00:00:00.000000000,133.75,135.38,130.93,134.14,106239800,0.0,0.0,AAPL,1
2021-02-02T00:00:00.000000000,135.73,136.31,134.61,134.99,83305400,0.0,0.0,AAPL,1
2021-02-03T00:00:00.000000000,135.76,135.77,133.61,133.94,89880900,0.0,0.0,AAPL,1
2021-02-04T00:00:00.000000000,136.3,137.4,134.59,137.39,84183100,0.0,0.0,AAPL,1
2021-02-05T00:00:00.000000000,137.35,137.42,135.86,136.76,75693800,0.2,0.0,AAPL,1
""".strip(), try_parse_dates=True)

Solution

  • To label non-overlapping k month time-windows starting from the latest date in each group and going backwards, you could use the following helper function.

    def create_monthly_window_index(col: str, k: int = 1):
        year_diff = pl.col(col).last().dt.year() - pl.col(col).dt.year()
        month_diff = pl.col(col).last().dt.month() - pl.col(col).dt.month()
        day_indicator = pl.col(col).dt.day() > pl.col(col).last().dt.day()
        return (12 * year_diff + month_diff - day_indicator) // k
    

    It can be used as follows.

    df.with_columns(create_monthly_window_index("date", k=1).alias("window_index").over("group"))
    

    Note that the general case of arbitrary time windows is more involved (as outlined in my comment).