I was trying to implement rolling autocorrelation in polars, but got some weird results when there're null
s involved.
The code is pretty simple. Let's say I have two dataframes df1
and df2
:
df1 = pl.DataFrame({'a': [1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [1., 1.06, 1.07,
0.93, 0.78]})
df2 = pl.DataFrame({'a': [1., 1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [None, 1., 1.06, 1.07, 0.93, 0.78]})
You can see that the only difference is that in df2
, the first row for lag_a
is None, because it's shifted from a
.
When I compute the rolling_corr
for both dataframes, however, I got different results.
# df1.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))
shape: (5, 1)
┌──────────┐
│ a │
│ --- │
│ f64 │
╞══════════╡
│ null │
│ null │
│ null │
│ null │
│ 0.622047 │
└──────────┘
# df2.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ f64 │
╞═══════════╡
│ null │
│ null │
│ null │
│ null │
│ null │
│ -0.219851 │
└───────────┘
The result from df1
, i.e. 0.622047 is what I got from numpy.corrcoef
as well. I wonder where the -0.219851 is coming from.
I think this is a bug in the Rust implementation of rolling_corr
(in fairness, it is marked unstable in python). It looks it naively applies rolling_mean
without first applying the joint null mask. So the rolling mean of a
that's used in the computation is
df2.get_column("a").rolling_mean(window_size=10, min_periods=5)
shape: (6,)
Series: 'a' [f64]
[
null
null
null
null
0.968
0.948333
]
That's the correct rolling mean in a vacuum, but in this case the first row of df2
’s column a
should be considered null because lag_a
is null there, and so df2
’s rolling mean should be the same as df1
’s rolling mean, with an extra null up front.
df1.get_column("a").rolling_mean(window_size=10, min_periods=5)
shape: (5,)
Series: 'a' [f64]
[
null
null
null
null
0.938
]
I'd suggest filing a bug report or even a PR. It doesn't look like a hard fix, it should just require precomputing the mask and applying filters to all expressions before calculating rolling stats on them.
In the meantime, you can apply the mask yourself before computing the correlation:
df2.with_columns(
pl.when(pl.any_horizontal(pl.all().is_null()))
.then(None)
.otherwise(pl.all())
.name.keep()
).select(pl.rolling_corr("a", "lag_a", window_size=10, min_periods=5))
shape: (6, 1)
┌──────────┐
│ a │
│ --- │
│ f64 │
╞══════════╡
│ null │
│ null │
│ null │
│ null │
│ null │
│ 0.622047 │
└──────────┘