Search code examples
pythonpandasstatsmodels

rolling regression with a simple apply in pandas


Consider this simple example

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9,1,3,5,7,4,5,6,4,7,8,9],
                   'b':[3,5,6,2,4,6,2,5,7,1,9,5,3,2,5,4,3,6,4,1,1,9]})

I am trying to perform a rolling regression of a on b. I am trying to use the simplest pandas tool available: apply. I want to use apply because I want to keep the flexibility of returning any parameter of the regression.

However, the simple code below does not work

df.rolling(10).apply(lambda x: smf.ols('a ~ b', data = x).fit())

  File "<string>", line 1, in <module>

PatsyError: Error evaluating factor: NameError: name 'b' is not defined
    a ~ b
    ^

What is the issue? Thanks!


Solution

  • rolling apply is not capable of interacting with multiple columns simultaneously, nor is it able to produce non-numeric values. We instead need to take advantage of the iterable nature of rolling objects. We also need to account for handling min_periods ourselves, since the iterable rolling object generates all windows results regardless of other rolling arguments.

    We can then create some function to produce each row in the results from the regression results to do something like:

    def process(x):
        if len(x) >= 10:
            reg = smf.ols('a ~ b', data=x).fit()
            print(reg.params)
            return [
                # b from params
                reg.params['b'],
                # b from tvalues
                reg.tvalues['b'],
                # Both lower and upper b from conf_int()
                *reg.conf_int().loc['b', :].tolist()
            ]
        # Return NaN in the same dimension as the results
        return [np.nan] * 4
    
    
    df = df.join(
        # join new DataFrame back to original
        pd.DataFrame(
            (process(x) for x in df.rolling(10)),
            columns=['coef', 't', 'lower', 'upper']
        )
    )
    

    df:

        a  b      coef         t     lower     upper
    0   1  3       NaN       NaN       NaN       NaN
    1   3  5       NaN       NaN       NaN       NaN
    2   5  6       NaN       NaN       NaN       NaN
    3   7  2       NaN       NaN       NaN       NaN
    4   4  4       NaN       NaN       NaN       NaN
    5   5  6       NaN       NaN       NaN       NaN
    6   6  2       NaN       NaN       NaN       NaN
    7   4  5       NaN       NaN       NaN       NaN
    8   7  7       NaN       NaN       NaN       NaN
    9   8  1 -0.216802 -0.602168 -1.047047  0.613442
    10  9  9  0.042781  0.156592 -0.587217  0.672778
    11  1  5  0.032086  0.097763 -0.724742  0.788913
    12  3  3  0.113475  0.329006 -0.681872  0.908822
    13  5  2  0.198582  0.600297 -0.564258  0.961421
    14  7  5  0.203540  0.611002 -0.564646  0.971726
    15  4  4  0.236599  0.686744 -0.557872  1.031069
    16  5  3  0.293651  0.835945 -0.516403  1.103704
    17  6  6  0.314286  0.936382 -0.459698  1.088269
    18  4  4  0.276316  0.760812 -0.561191  1.113823
    19  7  1  0.346491  1.028220 -0.430590  1.123572
    20  8  1 -0.492424 -1.234601 -1.412181  0.427332
    21  9  9  0.235075  0.879433 -0.381326  0.851476
    

    Setup:

    import pandas as pd
    import numpy as np
    import statsmodels.formula.api as smf
    
    df = pd.DataFrame({
        'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9, 1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
        'b': [3, 5, 6, 2, 4, 6, 2, 5, 7, 1, 9, 5, 3, 2, 5, 4, 3, 6, 4, 1, 1, 9]
    })