Search code examples
pythonpython-polars

Polars Rolling Corr giving weird results


I was trying to implement rolling autocorrelation in polars, but got some weird results when there're nulls involved.

The code is pretty simple. Let's say I have two dataframes df1 and df2:

df1 = pl.DataFrame({'a': [1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [1., 1.06, 1.07, 
0.93, 0.78]})

df2 = pl.DataFrame({'a': [1., 1.06, 1.07, 0.93, 0.78, 0.85], 'lag_a': [None, 1., 1.06, 1.07, 0.93, 0.78]})

You can see that the only difference is that in df2, the first row for lag_a is None, because it's shifted from a.

When I compute the rolling_corr for both dataframes, however, I got different results.

# df1.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))

shape: (5, 1)
┌──────────┐
│ a        │
│ ---      │
│ f64      │
╞══════════╡
│ null     │
│ null     │
│ null     │
│ null     │
│ 0.622047 │
└──────────┘
# df2.select(pl.rolling_corr('a', 'lag_a', window_size=10, min_periods=5, ddof=1))
shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ f64       │
╞═══════════╡
│ null      │
│ null      │
│ null      │
│ null      │
│ null      │
│ -0.219851 │
└───────────┘

The result from df1, i.e. 0.622047 is what I got from numpy.corrcoef as well. I wonder where the -0.219851 is coming from.


Solution

  • I think this is a bug in the Rust implementation of rolling_corr (in fairness, it is marked unstable in python). It looks it naively applies rolling_mean without first applying the joint null mask. So the rolling mean of a that's used in the computation is

    df2.get_column("a").rolling_mean(window_size=10, min_periods=5)
    
    shape: (6,)
    Series: 'a' [f64]
    [
        null
        null
        null
        null
        0.968
        0.948333
    ]
    

    That's the correct rolling mean in a vacuum, but in this case the first row of df2’s column a should be considered null because lag_a is null there, and so df2’s rolling mean should be the same as df1’s rolling mean, with an extra null up front.

    df1.get_column("a").rolling_mean(window_size=10, min_periods=5)
    
    shape: (5,)
    Series: 'a' [f64]
    [
        null
        null
        null
        null
        0.938
    ]
    

    I'd suggest filing a bug report or even a PR. It doesn't look like a hard fix, it should just require precomputing the mask and applying filters to all expressions before calculating rolling stats on them.

    In the meantime, you can apply the mask yourself before computing the correlation:

    df2.with_columns(
        pl.when(pl.any_horizontal(pl.all().is_null()))
        .then(None)
        .otherwise(pl.all())
        .name.keep()
    ).select(pl.rolling_corr("a", "lag_a", window_size=10, min_periods=5))
    
    shape: (6, 1)
    ┌──────────┐
    │ a        │
    │ ---      │
    │ f64      │
    ╞══════════╡
    │ null     │
    │ null     │
    │ null     │
    │ null     │
    │ null     │
    │ 0.622047 │
    └──────────┘