I am trying to implement a Rolling Regression in Python and failed to do so using the statsmodels 'MovingOLS'. In my data frame, the 'Year' column specifies the year of the respective observation.
Now, I want to regress 'F1_Earnings' on 'Earnings' and 'WC' with a rolling 2-year-window, such that the forecast made in year 1998 for year 1999 is based on the 2 preceding years, 1997 and 1998, but I do not get a meaningful result, probably because I haven't understood how to set the window
parameter properly. So how do I relate the window
parameter to the 'Year' variable?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import seaborn
import statsmodels.api as sm
from statsmodels.regression.rolling import RollingOLS
d1 = {'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6], 'Earnings': [100, 200,
400, 250, 300, 350, 400, 550, 700, 259, 300, 350, 270, 450, 340, 570, 340, 340], 'WC':
[20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52, 23, 54, 45, 87, 54, 65], 'Year': [1995,
1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998, 1995, 1997, 1998, 1996,
1997, 1998], 'F1_Earnings': [120, 220, 420, 280, 530, 670, 780, 210, 950, 100, 120, 430,
780, 210, 950, 100, 120, 430]}
df1 = pd.DataFrame(data=d1)
y = df1['F1_Earnings']
features = ["Earnings", "WC"]
x = df1[features]
rols = RollingOLS(y, x, window=2)
rres = rols.fit()
params = rres.params.copy()
params.index = np.arange(1, params.shape[0] + 1)
params.head()
I think an issue you are running into is that window (int): Length of the rolling window. Must be strictly larger than the number of variables in the model.
(as from the documentation).
Also the window
is just the count of observations. So window=2
will just use the two previous items in the list. This isn't going to work since you have a variable number of observations from each year.
Also you need to manually add an intercept (as a constant) if you want slope + intercept.
Given your variable number of records per year, I think a for loop is the only real option.
import pandas as pd
from statsmodels.regression.linear_model import OLS
d1 = {
'ID': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6],
'Earnings': [100, 200, 400, 250, 300, 350, 400, 550, 700, 259, 300, 350, 270, 450, 340, 570, 340, 340],
'WC': [20, 40, 35, 55, 60, 65, 30, 28, 32, 45, 60, 52, 23, 54, 45, 87, 54, 65],
'Year': [1995, 1996, 1997, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998, 1995, 1997, 1998, 1996, 1997, 1998],
'F1_Earnings': [120, 220, 420, 280, 530, 670, 780, 210, 950, 100, 120, 430, 780, 210, 950, 100, 120, 430],
}
df1 = pd.DataFrame(data=d1)
# df1 = add_constant(df1)
df1['intercept'] = 1
outcome = ['F1_Earnings']
features = ["Earnings", "WC", 'intercept']
result = {}
for year in df1['Year'].unique():
current_df = df1[(df1["Year"] <= year) & (df1["Year"] >= (year - 1))]
model = OLS(current_df[outcome], current_df[features]).fit()
result[year + 1] = model.params