Rolling Regression in Python

I am trying to implement a Rolling Regression in Python and failed to do so using the statsmodels 'MovingOLS'. In my data frame, the 'Year' column specifies the year of the respective observation.

Now, I want to regress 'F1_Earnings' on 'Earnings' and 'WC' with a rolling 2-year-window, such that the forecast made in year 1998 for year 1999 is based on the 2 preceding years, 1997 and 1998, but I do not get a meaningful result, probably because I haven't understood how to set the window parameter properly. So how do I relate the window parameter to the 'Year' variable?

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import seaborn
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS

d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6], 'Earnings': [100, 200, 
400, 250, 300, 350, 400, 550, 700, 259, 300, 350, 270, 450, 340, 570, 340, 340], 'WC': 
[20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52, 23, 54, 45, 87, 54, 65], 'Year': [1995, 
1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 
1997, 1998], 'F1_Earnings': [120, 220, 420, 280, 530, 670, 780, 210, 950, 100, 120, 430, 
780, 210, 950, 100, 120, 430]}

df1 = pd.DataFrame(data=d1)

y = df1['F1_Earnings']
features = ["Earnings", "WC"]
x = df1[features]

rols = RollingOLS(y, x, window=2)
rres = rols.fit()
params = rres.params.copy()
params.index = np.arange(1, params.shape[0] + 1)
params.head()

Solution

I think an issue you are running into is that window (int): Length of the rolling window. Must be strictly larger than the number of variables in the model. (as from the documentation).

Also the window is just the count of observations. So window=2 will just use the two previous items in the list. This isn't going to work since you have a variable number of observations from each year.

Also you need to manually add an intercept (as a constant) if you want slope + intercept.

Given your variable number of records per year, I think a for loop is the only real option.


import pandas as pd
from statsmodels.regression.linear_model import OLS

d1 = {
    'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6],
    'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350, 270, 450, 340, 570, 340, 340],
    'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52, 23, 54, 45, 87, 54, 65],
    'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998],
    'F1_Earnings': [120, 220, 420, 280, 530, 670, 780, 210, 950, 100, 120, 430, 780, 210, 950, 100, 120, 430],
}

df1 = pd.DataFrame(data=d1)
# df1 = add_constant(df1)
df1['intercept'] = 1

outcome = ['F1_Earnings']
features = ["Earnings", "WC", 'intercept']

result = {}
for year in df1['Year'].unique():
    current_df = df1[(df1["Year"] <= year) & (df1["Year"] >= (year - 1))]
    model = OLS(current_df[outcome], current_df[features]).fit()
    result[year + 1] = model.params