I am hoping to do an event study analysis, but I cannot seem to properly build a simple predictive mode with time as the independent variable. I've been using this as a guide.
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
#sample data
units = [0.916301354, 0.483947819, 0.551258976, 0.147971439, 0.617461504, 0.957460424, 0.905076453, 0.274261518, 0.861609383, 0.285914819, 0.989686616, 0.86614591, 0.074250832, 0.209507105, 0.082518752, 0.215795111, 0.953852132, 0.768329343, 0.380686392, 0.623940323, 0.155944248, 0.495745862, 0.0845513, 0.519966471, 0.706618333, 0.872300766, 0.70769554, 0.760616731, 0.213847926, 0.703866155, 0.802862491, 0.52468101, 0.352283626, 0.128962646, 0.684358794, 0.360520106, 0.889978575, 0.035806225, 0.15459103, 0.227742501, 0.06248614, 0.903500165, 0.13851151, 0.664684486, 0.011042697, 0.86353796, 0.971852899, 0.487774978, 0.547767217, 0.153629408, 0.076994094, 0.230693561, 0.961345948]
begin_date = '2022-8-01'
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})
# Create estimation data set
est_data = df['2022-08-01':'2022-08-30']
# And observation data
obs_data = df['2022-09-01':'2022-09-14']
# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()
# Get AR
# Using mean of estimation return
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return
# Then using model fit with estimation data
obs_data['risk_pred'] = m.predict()
obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']
# Graph the results
sns.lineplot(x = obs_data['date'],y = obs_data['AR_risk'])
plt.show()
As is, it won't recognise the date as a variable (image attached)
I've tried leaving the index as a counter, and just making the date a separate variable, but then when it gets to the "predict" portion, and it doesn't understand how to predict on dates that it has not seen before.
There are quite a lot of bugs in your code. I'll explain one by one in the following (check comments between ''' '''):
'''
small note, here you defined the variable as units and below you want to use a column called "variable".
Not a big problem, most probably you were reading the data from a file anyway, just something to keep in mind
'''
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})
'''
The following two lines do not work like that.
First, the dataframe is not indexed by a datetime
Second, to reference the index you need to use .iloc. Alternatively you can use .loc
'''
# Create estimation data set
est_data = df['2022-08-01':'2022-08-30']
# And observation data
obs_data = df['2022-09-01':'2022-09-14']
'''
Here you are fitting according to est_data.
using the m.predict() function will give you the fitted points of est_data.
This will be important later
'''
# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()
# Get AR
# Using mean of estimation return
'''
you don't need np.mean for this, just use est_data['variable'].mean()
Also it is most probably not needed to have the mean in your script.
You can directly subtract using obs_data['variable'] - est_data['variable'].mean()
'''
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return
'''
This will not always work, and in this case it does not.
m.predict() returns the predictions based on the data in est_data. The same number of points will be outputed
In order for this to work, obs_data needs to have the same number of points as est_data
'''
obs_data['risk_pred'] = m.predict()
obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']
Am currently working on fixing the bugs, will give you a working example soon. For this can you please leave me answers to the following question:
est_data
? If so, how are you gonna combine this with obs_data
?Edit 1: how to separate the data
The following code references the dates in the data frame:
est_data_Start = pd.to_datetime('2022-08-01')
est_data_End = pd.to_datetime('2022-08-30')
obs_data_Start = pd.to_datetime('2022-09-01')
est_data = df[df["date"].between(est_data_Start,est_data_End)]
obs_data = df[df["date"]>obs_data_Start]
The results for est_data
are:
date variable
0 2022-08-01 0.916301
1 2022-08-02 0.483948
2 2022-08-03 0.551259
3 2022-08-04 0.147971
4 2022-08-05 0.617462
5 2022-08-06 0.957460
6 2022-08-07 0.905076
7 2022-08-08 0.274262
8 2022-08-09 0.861609
9 2022-08-10 0.285915
10 2022-08-11 0.989687
11 2022-08-12 0.866146
12 2022-08-13 0.074251
13 2022-08-14 0.209507
14 2022-08-15 0.082519
15 2022-08-16 0.215795
16 2022-08-17 0.953852
17 2022-08-18 0.768329
18 2022-08-19 0.380686
19 2022-08-20 0.623940
20 2022-08-21 0.155944
21 2022-08-22 0.495746
22 2022-08-23 0.084551
23 2022-08-24 0.519966
24 2022-08-25 0.706618
25 2022-08-26 0.872301
26 2022-08-27 0.707696
27 2022-08-28 0.760617
28 2022-08-29 0.213848
29 2022-08-30 0.703866
And the rest is going to obs_data
.
Edit 2: fitting and predicting
The following code uses OLS to fit a model to the est_data
. Then, the model is used to predict the values based on the data found in obs_data
:
XTrain = est_data.index # get the training predictor
XTrain = sm.add_constant(XTrain) # add constant term to account for any intercept
m = sm.OLS(est_data["variable"], XTrain).fit() # fit according to training
XTest = obs_data.index # get the testing predictors
XTest = sm.add_constant(XTest) # and add a constant term
obs_data["risk_pred"] = m.predict(XTest) # predict based on the new data
# the following two calculations I just copied from you...
obs_data["AR_mean"] = obs_data["variable"] - est_data["variable"].mean()
obs_data["AR_risk"] = obs_data["variable"] - obs_data["risk_pred"]
The following code plots the results:
plt.figure()
plt.plot(est_data["date"], est_data["variable"], "-o", label = "Estimated")
plt.plot(obs_data["date"], obs_data["variable"], "-o", label = "Observed")
plt.plot(est_data["date"], m.predict(XTrain), label = "Train fit")
plt.plot(obs_data["date"], m.predict(XTest), label = "Test fit")
plt.legend(ncols =4, bbox_to_anchor=[0.5, 1.1, 0.5, 0])
plt.grid()
locator = mdates.AutoDateLocator(minticks = 7)
formatter = mdates.ConciseDateFormatter(locator)
plt.gca().xaxis.set_major_locator(locator)
plt.gca().xaxis.set_major_formatter(formatter)
The results and the imports are in the following section:
Imports:
import pandas as pd
import numpy as np
import statsmodels.api as sm
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib.dates as mdates