Search code examples
python-polars

How do I include multiple parameters in rolling_apply using Polars Python?


For Polars, I can see that we can apply with multiple column values using pl.struct here, but can't figure out whether you can do the same thing but using rolling_apply.

I am looking to see whether I can performing a rolling regression, or a rolling correlation.

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {
        "time": pl.date_range(
            start=datetime(2021, 1, 1),
            end=datetime(2022, 1, 1),
            interval="1d",
            eager=True
        ),
        "x": pl.int_range(0, 366, eager=True).shuffle(seed=1),
        "y": pl.int_range(0, 366, eager=True).shuffle(seed=2)
    }
)

If I did the following, I would get a single value

df.select(
     pl.corr("x", "y", method="spearman").alias("corr")
)

Is it possible to get it on a rolling basis? Thanks

Edit:

I have tried to use the answer suggested below to create rolling regression:

def get_regression_parameter(args: List[pl.Series]) -> pl.Series:
    return pl.Series([sm.OLS(args[0], args[1]).fit().params[0]], dtype=pl.Float64)

df.rolling(
    index_column='time',
    period='14d',
).agg(
    pl.col('x').alias('x_list'),
    pl.col('y').alias('y_list'),
    pl.map_groups(
        exprs=["y", "x"], 
        function=get_regression_parameter).alias("corr")
)

but I am receiving the following error:

thread 'polars-7' panicked at crates/polars-python/src/map/lazy.rs:163:19:
PanicException: python function failed: unrecognized data structures: <class 'polars.series.series.Series'> / <class 'polars.series.series.Series'>

My workaround for this is:

df.rolling(
    index_column='time',
    period='14d',
).agg(
    pl.col('x').alias('x_list'),
    pl.col('y').alias('y_list'),
).with_columns(
    pl.struct('x_list', 'y_list').map_elements(lambda x: sm.OLS(x['y_list'], x['x_list']).fit().params[0]).alias('coefficient')
)

But it is outside the .agg function (which is fine - I just wanted to ask what would be considered the best way to perform this).


Solution

  • I recommend using rolling.

    df.rolling(
        index_column='time',
        period='14d',
    ).agg(
        pl.corr('x', 'y', method='spearman').alias('sp_rank'),
    )
    
    shape: (366, 2)
    ┌────────────┬───────────┐
    │ time       ┆ sp_rank   │
    │ ---        ┆ ---       │
    │ date       ┆ f64       │
    ╞════════════╪═══════════╡
    │ 2021-01-01 ┆ NaN       │
    │ 2021-01-02 ┆ -1.0      │
    │ 2021-01-03 ┆ -1.0      │
    │ 2021-01-04 ┆ -0.4      │
    │ 2021-01-05 ┆ -0.3      │
    │ …          ┆ …         │
    │ 2021-12-28 ┆ -0.006593 │
    │ 2021-12-29 ┆ -0.112088 │
    │ 2021-12-30 ┆ -0.257143 │
    │ 2021-12-31 ┆ -0.059341 │
    │ 2022-01-01 ┆ 0.032967  │
    └────────────┴───────────┘
    

    One thing I find helpful when first setting up a rolling is to produce a list of values in each period:

    df.rolling(
        index_column='time',
        period='14d',
    ).agg(
        pl.col('x').alias('x_list'),
        pl.col('y').alias('y_list'),
        pl.corr('x', 'y', method='spearman').alias('sp_rank'),
    )
    
    shape: (366, 4)
    ┌────────────┬───────────────────┬───────────────────┬───────────┐
    │ time       ┆ x_list            ┆ y_list            ┆ sp_rank   │
    │ ---        ┆ ---               ┆ ---               ┆ ---       │
    │ date       ┆ list[i64]         ┆ list[i64]         ┆ f64       │
    ╞════════════╪═══════════════════╪═══════════════════╪═══════════╡
    │ 2021-01-01 ┆ [245]             ┆ [246]             ┆ NaN       │
    │ 2021-01-02 ┆ [245, 334]        ┆ [246, 128]        ┆ -1.0      │
    │ 2021-01-03 ┆ [245, 334, 74]    ┆ [246, 128, 295]   ┆ -1.0      │
    │ 2021-01-04 ┆ [245, 334, … 52]  ┆ [246, 128, … 150] ┆ -0.4      │
    │ 2021-01-05 ┆ [245, 334, … 117] ┆ [246, 128, … 350] ┆ -0.3      │
    │ …          ┆ …                 ┆ …                 ┆ …         │
    │ 2021-12-28 ┆ [243, 285, … 172] ┆ [6, 186, … 16]    ┆ -0.006593 │
    │ 2021-12-29 ┆ [285, 81, … 124]  ┆ [186, 331, … 298] ┆ -0.112088 │
    │ 2021-12-30 ┆ [81, 174, … 66]   ┆ [331, 2, … 214]   ┆ -0.257143 │
    │ 2021-12-31 ┆ [174, 221, … 14]  ┆ [2, 208, … 82]    ┆ -0.059341 │
    │ 2022-01-01 ┆ [221, 276, … 261] ┆ [208, 40, … 275]  ┆ 0.032967  │
    └────────────┴───────────────────┴───────────────────┴───────────┘
    

    I typically only do this for a small subset at first, just to ensure that the results are correct. (For the final run, you won't want to keep the list columns.)

    Edit: using map_groups

    We can use the map_groups method in the agg method. Let's perform an OLS using the statsmodel library.

    First, let's expand your data so that we can show how to incorporate multiple independent variables.

    from datetime import datetime
    
    import polars as pl
    import statsmodels.api as sm
    import numpy as np
    
    df = pl.DataFrame(
        {
            "time": pl.date_range(
                start=datetime(2021, 1, 1),
                end=datetime(2022, 1, 1),
                interval="1d",
                eager=True
            ),
            "y": pl.int_range(0, 366, eager=True).shuffle(seed=4),
            "x1": pl.int_range(0, 366, eager=True).shuffle(seed=1),
            "x2": pl.int_range(0, 366, eager=True).shuffle(seed=2),
            "x3": pl.int_range(0, 366, eager=True).shuffle(seed=3),
        }
    )
    df
    
    shape: (366, 5)
    ┌────────────┬─────┬─────┬─────┬─────┐
    │ time       ┆ y   ┆ x1  ┆ x2  ┆ x3  │
    │ ---        ┆ --- ┆ --- ┆ --- ┆ --- │
    │ date       ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
    ╞════════════╪═════╪═════╪═════╪═════╡
    │ 2021-01-01 ┆ 347 ┆ 245 ┆ 246 ┆ 90  │
    │ 2021-01-02 ┆ 147 ┆ 334 ┆ 128 ┆ 38  │
    │ 2021-01-03 ┆ 85  ┆ 74  ┆ 295 ┆ 24  │
    │ 2021-01-04 ┆ 36  ┆ 52  ┆ 150 ┆ 297 │
    │ 2021-01-05 ┆ 31  ┆ 117 ┆ 350 ┆ 80  │
    │ …          ┆ …   ┆ …   ┆ …   ┆ …   │
    │ 2021-12-28 ┆ 125 ┆ 172 ┆ 16  ┆ 221 │
    │ 2021-12-29 ┆ 268 ┆ 124 ┆ 298 ┆ 246 │
    │ 2021-12-30 ┆ 275 ┆ 66  ┆ 214 ┆ 58  │
    │ 2021-12-31 ┆ 69  ┆ 14  ┆ 82  ┆ 150 │
    │ 2022-01-01 ┆ 233 ┆ 261 ┆ 275 ┆ 181 │
    └────────────┴─────┴─────┴─────┴─────┘
    

    Rather than use a lambda function, I prefer to create a separate function where I can keep my calculation choices all in one place.

    def regress_it(s_list: list[pl.Series]) -> pl.Series:
        np_array = np.column_stack(s_list)
        y = np_array[:, 0]
        X = np_array[:, 1:]
        X = sm.add_constant(X)
        result = sm.OLS(
            endog=y,
            exog=X,
        ).fit().params[0]
        return result
    
    df.rolling(index_column="time", period="14d").agg(
       pl.map_groups(
           exprs=["y", "x1", "x2", "x3"],
           function=regress_it
       )
       .alias('coeff')
    )
    

    The map_groups method will pass a list of Series to the called function (regress_it in this case).

    In the regress_it function, we'll first take this list of Series and column-stack them into a numpy array. We can then easily pass slices of the array to the OLS function.

    By the way we ordered the parameters in the apply call, the first column is our dependent variable y (the endog parameter in sm.OLS). The remaining columns (x1, x2, x3) will be our independent variables (the exog parameter in sm.OLS).

    I've also added a constant for the regression. Here's the output.

    shape: (366, 2)
    ┌────────────┬─────────────┐
    │ time       ┆ coeff       │
    │ ---        ┆ ---         │
    │ date       ┆ f64         │
    ╞════════════╪═════════════╡
    │ 2021-01-01 ┆ 0.66087     │
    │ 2021-01-02 ┆ 0.002035    │
    │ 2021-01-03 ┆ -0.013203   │
    │ 2021-01-04 ┆ -798.950143 │
    │ 2021-01-05 ┆ -239.766471 │
    │ …          ┆ …           │
    │ 2021-12-28 ┆ 284.286344  │
    │ 2021-12-29 ┆ 275.147184  │
    │ 2021-12-30 ┆ 285.589173  │
    │ 2021-12-31 ┆ 232.471054  │
    │ 2022-01-01 ┆ 231.976641  │
    └────────────┴─────────────┘