Search code examples
pythonpandaslinear-regressionstatsmodels

Efficient expanding OLS in pandas


I would like to explore the solutions of performing expanding OLS in pandas (or other libraries that accept DataFrame/Series friendly) efficiently.

  1. Assumming the dataset is large, I am NOT interested in any solutions with a for-loop;
  2. I am looking for solutions about expanding rather than rolling. Rolling functions always require a fixed window while expanding uses a variable window (starting from beginning);
  3. Please do not suggest pandas.stats.ols.MovingOLS because it is deprecated;
  4. Please do not suggest other deprecated methods such as expanding_mean.

For example, there is a DataFrame df with two columns X and y. To make it simpler, let's just calculate beta. Currently, I am thinking about something like

import numpy as np
import pandas as pd
import statsmodels.api as sm

def my_OLS_func(df, y_name, X_name):
  y = df[y_name]
  X = df[X_name]
  X = sm.add_constant(X)
  b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(y)
  return b

df = pd.DataFrame({'X':[1,2.5,3], 'y':[4,5,6.3]})

df['beta'] = df.expanding().apply(my_OLS_func, args = ('y', 'X'))

Expected values of df['beta'] are 0 (or NaN), 0.66666667, and 1.038462.

However, this method does not seem to work because the method seems very inflexible. I am not sure how one could pass the two Series as arguments. Any suggestions would be appreciated.


Solution

  • One option is to use the RecursiveLS (recursive least squares) model from Statsmodels:

    # Simulate some data
    rs = np.random.RandomState(seed=12345)
    
    nobs = 100000
    beta = [10., -0.2]
    sigma2 = 2.5
    
    exog = sm.add_constant(rs.uniform(size=nobs))
    eps = rs.normal(scale=sigma2**0.5, size=nobs)
    endog = np.dot(exog, beta) + eps
    
    # Construct and fit the recursive least squares model
    mod = sm.RecursiveLS(endog, exog)
    res = mod.fit()
    # This is a 2 x 100,000 numpy array with the regression coefficients
    # that would be estimated when using data from the beginning of the
    # sample to each point. You should usually ignore the first k=2
    # datapoints since they are controlled by a diffuse prior.
    res.recursive_coefficients.filtered