For Polars, I can see that we can apply
with multiple column values using pl.struct
here, but can't figure out whether you can do the same thing but using rolling_apply
.
I am looking to see whether I can performing a rolling regression, or a rolling correlation.
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2021, 1, 1),
end=datetime(2022, 1, 1),
interval="1d",
eager=True
),
"x": pl.int_range(0, 366, eager=True).shuffle(seed=1),
"y": pl.int_range(0, 366, eager=True).shuffle(seed=2)
}
)
If I did the following, I would get a single value
df.select(
pl.corr("x", "y", method="spearman").alias("corr")
)
Is it possible to get it on a rolling basis? Thanks
Edit:
I have tried to use the answer suggested below to create rolling regression:
def get_regression_parameter(args: List[pl.Series]) -> pl.Series:
return pl.Series([sm.OLS(args[0], args[1]).fit().params[0]], dtype=pl.Float64)
df.rolling(
index_column='time',
period='14d',
).agg(
pl.col('x').alias('x_list'),
pl.col('y').alias('y_list'),
pl.map_groups(
exprs=["y", "x"],
function=get_regression_parameter).alias("corr")
)
but I am receiving the following error:
thread 'polars-7' panicked at crates/polars-python/src/map/lazy.rs:163:19:
PanicException: python function failed: unrecognized data structures: <class 'polars.series.series.Series'> / <class 'polars.series.series.Series'>
My workaround for this is:
df.rolling(
index_column='time',
period='14d',
).agg(
pl.col('x').alias('x_list'),
pl.col('y').alias('y_list'),
).with_columns(
pl.struct('x_list', 'y_list').map_elements(lambda x: sm.OLS(x['y_list'], x['x_list']).fit().params[0]).alias('coefficient')
)
But it is outside the .agg
function (which is fine - I just wanted to ask what would be considered the best way to perform this).
I recommend using rolling
.
df.rolling(
index_column='time',
period='14d',
).agg(
pl.corr('x', 'y', method='spearman').alias('sp_rank'),
)
shape: (366, 2)
┌────────────┬───────────┐
│ time ┆ sp_rank │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════════╡
│ 2021-01-01 ┆ NaN │
│ 2021-01-02 ┆ -1.0 │
│ 2021-01-03 ┆ -1.0 │
│ 2021-01-04 ┆ -0.4 │
│ 2021-01-05 ┆ -0.3 │
│ … ┆ … │
│ 2021-12-28 ┆ -0.006593 │
│ 2021-12-29 ┆ -0.112088 │
│ 2021-12-30 ┆ -0.257143 │
│ 2021-12-31 ┆ -0.059341 │
│ 2022-01-01 ┆ 0.032967 │
└────────────┴───────────┘
One thing I find helpful when first setting up a rolling
is to produce a list of values in each period:
df.rolling(
index_column='time',
period='14d',
).agg(
pl.col('x').alias('x_list'),
pl.col('y').alias('y_list'),
pl.corr('x', 'y', method='spearman').alias('sp_rank'),
)
shape: (366, 4)
┌────────────┬───────────────────┬───────────────────┬───────────┐
│ time ┆ x_list ┆ y_list ┆ sp_rank │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list[i64] ┆ list[i64] ┆ f64 │
╞════════════╪═══════════════════╪═══════════════════╪═══════════╡
│ 2021-01-01 ┆ [245] ┆ [246] ┆ NaN │
│ 2021-01-02 ┆ [245, 334] ┆ [246, 128] ┆ -1.0 │
│ 2021-01-03 ┆ [245, 334, 74] ┆ [246, 128, 295] ┆ -1.0 │
│ 2021-01-04 ┆ [245, 334, … 52] ┆ [246, 128, … 150] ┆ -0.4 │
│ 2021-01-05 ┆ [245, 334, … 117] ┆ [246, 128, … 350] ┆ -0.3 │
│ … ┆ … ┆ … ┆ … │
│ 2021-12-28 ┆ [243, 285, … 172] ┆ [6, 186, … 16] ┆ -0.006593 │
│ 2021-12-29 ┆ [285, 81, … 124] ┆ [186, 331, … 298] ┆ -0.112088 │
│ 2021-12-30 ┆ [81, 174, … 66] ┆ [331, 2, … 214] ┆ -0.257143 │
│ 2021-12-31 ┆ [174, 221, … 14] ┆ [2, 208, … 82] ┆ -0.059341 │
│ 2022-01-01 ┆ [221, 276, … 261] ┆ [208, 40, … 275] ┆ 0.032967 │
└────────────┴───────────────────┴───────────────────┴───────────┘
I typically only do this for a small subset at first, just to ensure that the results are correct. (For the final run, you won't want to keep the list columns.)
map_groups
We can use the map_groups
method in the agg
method. Let's perform an OLS using the statsmodel
library.
First, let's expand your data so that we can show how to incorporate multiple independent variables.
from datetime import datetime
import polars as pl
import statsmodels.api as sm
import numpy as np
df = pl.DataFrame(
{
"time": pl.date_range(
start=datetime(2021, 1, 1),
end=datetime(2022, 1, 1),
interval="1d",
eager=True
),
"y": pl.int_range(0, 366, eager=True).shuffle(seed=4),
"x1": pl.int_range(0, 366, eager=True).shuffle(seed=1),
"x2": pl.int_range(0, 366, eager=True).shuffle(seed=2),
"x3": pl.int_range(0, 366, eager=True).shuffle(seed=3),
}
)
df
shape: (366, 5)
┌────────────┬─────┬─────┬─────┬─────┐
│ time ┆ y ┆ x1 ┆ x2 ┆ x3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═════╪═════╪═════╪═════╡
│ 2021-01-01 ┆ 347 ┆ 245 ┆ 246 ┆ 90 │
│ 2021-01-02 ┆ 147 ┆ 334 ┆ 128 ┆ 38 │
│ 2021-01-03 ┆ 85 ┆ 74 ┆ 295 ┆ 24 │
│ 2021-01-04 ┆ 36 ┆ 52 ┆ 150 ┆ 297 │
│ 2021-01-05 ┆ 31 ┆ 117 ┆ 350 ┆ 80 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 2021-12-28 ┆ 125 ┆ 172 ┆ 16 ┆ 221 │
│ 2021-12-29 ┆ 268 ┆ 124 ┆ 298 ┆ 246 │
│ 2021-12-30 ┆ 275 ┆ 66 ┆ 214 ┆ 58 │
│ 2021-12-31 ┆ 69 ┆ 14 ┆ 82 ┆ 150 │
│ 2022-01-01 ┆ 233 ┆ 261 ┆ 275 ┆ 181 │
└────────────┴─────┴─────┴─────┴─────┘
Rather than use a lambda function, I prefer to create a separate function where I can keep my calculation choices all in one place.
def regress_it(s_list: list[pl.Series]) -> pl.Series:
np_array = np.column_stack(s_list)
y = np_array[:, 0]
X = np_array[:, 1:]
X = sm.add_constant(X)
result = sm.OLS(
endog=y,
exog=X,
).fit().params[0]
return result
df.rolling(index_column="time", period="14d").agg(
pl.map_groups(
exprs=["y", "x1", "x2", "x3"],
function=regress_it
)
.alias('coeff')
)
The map_groups
method will pass a list of Series
to the called function (regress_it
in this case).
In the regress_it
function, we'll first take this list of Series and column-stack them into a numpy array. We can then easily pass slices of the array to the OLS function.
By the way we ordered the parameters in the apply
call, the first column is our dependent variable y
(the endog
parameter in sm.OLS
). The remaining columns (x1
, x2
, x3
) will be our independent variables (the exog
parameter in sm.OLS
).
I've also added a constant for the regression. Here's the output.
shape: (366, 2)
┌────────────┬─────────────┐
│ time ┆ coeff │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═════════════╡
│ 2021-01-01 ┆ 0.66087 │
│ 2021-01-02 ┆ 0.002035 │
│ 2021-01-03 ┆ -0.013203 │
│ 2021-01-04 ┆ -798.950143 │
│ 2021-01-05 ┆ -239.766471 │
│ … ┆ … │
│ 2021-12-28 ┆ 284.286344 │
│ 2021-12-29 ┆ 275.147184 │
│ 2021-12-30 ┆ 285.589173 │
│ 2021-12-31 ┆ 232.471054 │
│ 2022-01-01 ┆ 231.976641 │
└────────────┴─────────────┘