python indexing time-series regression predict

Trouble doing time series analysis in Python

I am hoping to do an event study analysis, but I cannot seem to properly build a simple predictive mode with time as the independent variable. I've been using this as a guide.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

#sample data
units = [0.916301354, 0.483947819, 0.551258976, 0.147971439, 0.617461504, 0.957460424, 0.905076453, 0.274261518, 0.861609383, 0.285914819, 0.989686616, 0.86614591, 0.074250832, 0.209507105, 0.082518752, 0.215795111, 0.953852132, 0.768329343, 0.380686392, 0.623940323, 0.155944248, 0.495745862, 0.0845513, 0.519966471, 0.706618333, 0.872300766, 0.70769554, 0.760616731, 0.213847926, 0.703866155, 0.802862491, 0.52468101, 0.352283626, 0.128962646, 0.684358794, 0.360520106, 0.889978575, 0.035806225, 0.15459103, 0.227742501, 0.06248614, 0.903500165, 0.13851151, 0.664684486, 0.011042697, 0.86353796, 0.971852899, 0.487774978, 0.547767217, 0.153629408, 0.076994094, 0.230693561, 0.961345948]
begin_date = '2022-8-01'
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})


# Create estimation data set
est_data = df['2022-08-01':'2022-08-30']

# And observation data
obs_data = df['2022-09-01':'2022-09-14']

# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()

# Get AR
# Using mean of estimation return
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return

# Then using model fit with estimation data
obs_data['risk_pred'] = m.predict()

obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']

# Graph the results
sns.lineplot(x = obs_data['date'],y = obs_data['AR_risk'])
plt.show()

As is, it won't recognise the date as a variable (image attached)

I've tried leaving the index as a counter, and just making the date a separate variable, but then when it gets to the "predict" portion, and it doesn't understand how to predict on dates that it has not seen before.

Solution

There are quite a lot of bugs in your code. I'll explain one by one in the following (check comments between ''' '''):

'''
small note, here you defined the variable as units and below you want to use a column called "variable".
Not a big problem, most probably you were reading the data from a file anyway, just something to keep in mind
'''
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})
'''
The following two lines do not work like that. 
First, the dataframe is not indexed by a datetime
Second, to reference the index you need to use .iloc. Alternatively you can use .loc
'''
# Create estimation data set
est_data = df['2022-08-01':'2022-08-30'] 
# And observation data
obs_data = df['2022-09-01':'2022-09-14']
'''
Here you are fitting according to est_data.
using the m.predict() function will give you the fitted points of est_data.
This will be important later
'''
# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()
# Get AR
# Using mean of estimation return
'''
you don't need np.mean for this, just use est_data['variable'].mean()
Also it is most probably not needed to have the mean in your script.
You can directly subtract using obs_data['variable'] - est_data['variable'].mean()
'''
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return
'''
This will not always work, and in this case it does not.
m.predict() returns the predictions based on the data in est_data. The same number of points will be outputed
In order for this to work, obs_data needs to have the same number of points as est_data
'''
obs_data['risk_pred'] = m.predict()
obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']

Am currently working on fixing the bugs, will give you a working example soon. For this can you please leave me answers to the following question:

Do you really want to fit the model according to est_data? If so, how are you gonna combine this with obs_data?

Edit 1: how to separate the data

The following code references the dates in the data frame:

est_data_Start = pd.to_datetime('2022-08-01')
est_data_End = pd.to_datetime('2022-08-30')
obs_data_Start = pd.to_datetime('2022-09-01')
est_data = df[df["date"].between(est_data_Start,est_data_End)]
obs_data = df[df["date"]>obs_data_Start]

The results for est_data are:

    date    variable
0   2022-08-01  0.916301
1   2022-08-02  0.483948
2   2022-08-03  0.551259
3   2022-08-04  0.147971
4   2022-08-05  0.617462
5   2022-08-06  0.957460
6   2022-08-07  0.905076
7   2022-08-08  0.274262
8   2022-08-09  0.861609
9   2022-08-10  0.285915
10  2022-08-11  0.989687
11  2022-08-12  0.866146
12  2022-08-13  0.074251
13  2022-08-14  0.209507
14  2022-08-15  0.082519
15  2022-08-16  0.215795
16  2022-08-17  0.953852
17  2022-08-18  0.768329
18  2022-08-19  0.380686
19  2022-08-20  0.623940
20  2022-08-21  0.155944
21  2022-08-22  0.495746
22  2022-08-23  0.084551
23  2022-08-24  0.519966
24  2022-08-25  0.706618
25  2022-08-26  0.872301
26  2022-08-27  0.707696
27  2022-08-28  0.760617
28  2022-08-29  0.213848
29  2022-08-30  0.703866

And the rest is going to obs_data.

Edit 2: fitting and predicting

The following code uses OLS to fit a model to the est_data. Then, the model is used to predict the values based on the data found in obs_data:

XTrain = est_data.index # get the training predictor
XTrain = sm.add_constant(XTrain) # add constant term to account for any intercept
m = sm.OLS(est_data["variable"], XTrain).fit() # fit according to training
XTest = obs_data.index # get the testing predictors
XTest = sm.add_constant(XTest) # and add a constant term
obs_data["risk_pred"] = m.predict(XTest) # predict based on the new data
# the following two calculations I just copied from you...
obs_data["AR_mean"] = obs_data["variable"] - est_data["variable"].mean()
obs_data["AR_risk"] = obs_data["variable"] - obs_data["risk_pred"]

The following code plots the results:

plt.figure()
plt.plot(est_data["date"], est_data["variable"], "-o", label = "Estimated")
plt.plot(obs_data["date"], obs_data["variable"], "-o", label = "Observed")
plt.plot(est_data["date"], m.predict(XTrain), label = "Train fit")
plt.plot(obs_data["date"], m.predict(XTest), label = "Test fit")
plt.legend(ncols =4, bbox_to_anchor=[0.5, 1.1, 0.5, 0])
plt.grid()
locator = mdates.AutoDateLocator(minticks = 7) 
formatter = mdates.ConciseDateFormatter(locator) 
plt.gca().xaxis.set_major_locator(locator) 
plt.gca().xaxis.set_major_formatter(formatter)

The results and the imports are in the following section:

Imports:

import pandas as pd
import numpy as np
import statsmodels.api as sm
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib.dates as mdates