Search code examples
pythonindexingtime-seriesregressionpredict

Trouble doing time series analysis in Python


I am hoping to do an event study analysis, but I cannot seem to properly build a simple predictive mode with time as the independent variable. I've been using this as a guide.

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

#sample data
units = [0.916301354, 0.483947819, 0.551258976, 0.147971439, 0.617461504, 0.957460424, 0.905076453, 0.274261518, 0.861609383, 0.285914819, 0.989686616, 0.86614591, 0.074250832, 0.209507105, 0.082518752, 0.215795111, 0.953852132, 0.768329343, 0.380686392, 0.623940323, 0.155944248, 0.495745862, 0.0845513, 0.519966471, 0.706618333, 0.872300766, 0.70769554, 0.760616731, 0.213847926, 0.703866155, 0.802862491, 0.52468101, 0.352283626, 0.128962646, 0.684358794, 0.360520106, 0.889978575, 0.035806225, 0.15459103, 0.227742501, 0.06248614, 0.903500165, 0.13851151, 0.664684486, 0.011042697, 0.86353796, 0.971852899, 0.487774978, 0.547767217, 0.153629408, 0.076994094, 0.230693561, 0.961345948]
begin_date = '2022-8-01'
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})


# Create estimation data set
est_data = df['2022-08-01':'2022-08-30']

# And observation data
obs_data = df['2022-09-01':'2022-09-14']

# Estimate a model predicting stock price with market return
m = smf.ols('variable ~ date', data = est_data).fit()

# Get AR
# Using mean of estimation return
var_return = np.mean(est_data['variable'])
obs_data['AR_mean'] = obs_data['variable'] - var_return

# Then using model fit with estimation data
obs_data['risk_pred'] = m.predict()

obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']

# Graph the results
sns.lineplot(x = obs_data['date'],y = obs_data['AR_risk'])
plt.show()

As is, it won't recognise the date as a variable (image attached) error message

I've tried leaving the index as a counter, and just making the date a separate variable, but then when it gets to the "predict" portion, and it doesn't understand how to predict on dates that it has not seen before.


Solution

  • There are quite a lot of bugs in your code. I'll explain one by one in the following (check comments between ''' '''):

    '''
    small note, here you defined the variable as units and below you want to use a column called "variable".
    Not a big problem, most probably you were reading the data from a file anyway, just something to keep in mind
    '''
    df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(units)),'units':units})
    '''
    The following two lines do not work like that. 
    First, the dataframe is not indexed by a datetime
    Second, to reference the index you need to use .iloc. Alternatively you can use .loc
    '''
    # Create estimation data set
    est_data = df['2022-08-01':'2022-08-30'] 
    # And observation data
    obs_data = df['2022-09-01':'2022-09-14']
    '''
    Here you are fitting according to est_data.
    using the m.predict() function will give you the fitted points of est_data.
    This will be important later
    '''
    # Estimate a model predicting stock price with market return
    m = smf.ols('variable ~ date', data = est_data).fit()
    # Get AR
    # Using mean of estimation return
    '''
    you don't need np.mean for this, just use est_data['variable'].mean()
    Also it is most probably not needed to have the mean in your script.
    You can directly subtract using obs_data['variable'] - est_data['variable'].mean()
    '''
    var_return = np.mean(est_data['variable'])
    obs_data['AR_mean'] = obs_data['variable'] - var_return
    '''
    This will not always work, and in this case it does not.
    m.predict() returns the predictions based on the data in est_data. The same number of points will be outputed
    In order for this to work, obs_data needs to have the same number of points as est_data
    '''
    obs_data['risk_pred'] = m.predict()
    obs_data['AR_risk'] = obs_data['variable'] - obs_data['risk_pred']
    

    Am currently working on fixing the bugs, will give you a working example soon. For this can you please leave me answers to the following question:

    • Do you really want to fit the model according to est_data? If so, how are you gonna combine this with obs_data?

    Edit 1: how to separate the data

    The following code references the dates in the data frame:

    est_data_Start = pd.to_datetime('2022-08-01')
    est_data_End = pd.to_datetime('2022-08-30')
    obs_data_Start = pd.to_datetime('2022-09-01')
    est_data = df[df["date"].between(est_data_Start,est_data_End)]
    obs_data = df[df["date"]>obs_data_Start]
    

    The results for est_data are:

        date    variable
    0   2022-08-01  0.916301
    1   2022-08-02  0.483948
    2   2022-08-03  0.551259
    3   2022-08-04  0.147971
    4   2022-08-05  0.617462
    5   2022-08-06  0.957460
    6   2022-08-07  0.905076
    7   2022-08-08  0.274262
    8   2022-08-09  0.861609
    9   2022-08-10  0.285915
    10  2022-08-11  0.989687
    11  2022-08-12  0.866146
    12  2022-08-13  0.074251
    13  2022-08-14  0.209507
    14  2022-08-15  0.082519
    15  2022-08-16  0.215795
    16  2022-08-17  0.953852
    17  2022-08-18  0.768329
    18  2022-08-19  0.380686
    19  2022-08-20  0.623940
    20  2022-08-21  0.155944
    21  2022-08-22  0.495746
    22  2022-08-23  0.084551
    23  2022-08-24  0.519966
    24  2022-08-25  0.706618
    25  2022-08-26  0.872301
    26  2022-08-27  0.707696
    27  2022-08-28  0.760617
    28  2022-08-29  0.213848
    29  2022-08-30  0.703866
    

    And the rest is going to obs_data.

    Edit 2: fitting and predicting

    The following code uses OLS to fit a model to the est_data. Then, the model is used to predict the values based on the data found in obs_data:

    XTrain = est_data.index # get the training predictor
    XTrain = sm.add_constant(XTrain) # add constant term to account for any intercept
    m = sm.OLS(est_data["variable"], XTrain).fit() # fit according to training
    XTest = obs_data.index # get the testing predictors
    XTest = sm.add_constant(XTest) # and add a constant term
    obs_data["risk_pred"] = m.predict(XTest) # predict based on the new data
    # the following two calculations I just copied from you...
    obs_data["AR_mean"] = obs_data["variable"] - est_data["variable"].mean()
    obs_data["AR_risk"] = obs_data["variable"] - obs_data["risk_pred"]
    

    The following code plots the results:

    plt.figure()
    plt.plot(est_data["date"], est_data["variable"], "-o", label = "Estimated")
    plt.plot(obs_data["date"], obs_data["variable"], "-o", label = "Observed")
    plt.plot(est_data["date"], m.predict(XTrain), label = "Train fit")
    plt.plot(obs_data["date"], m.predict(XTest), label = "Test fit")
    plt.legend(ncols =4, bbox_to_anchor=[0.5, 1.1, 0.5, 0])
    plt.grid()
    locator = mdates.AutoDateLocator(minticks = 7) 
    formatter = mdates.ConciseDateFormatter(locator) 
    plt.gca().xaxis.set_major_locator(locator) 
    plt.gca().xaxis.set_major_formatter(formatter) 
    

    The results and the imports are in the following section:

    fit

    Imports:

    import pandas as pd
    import numpy as np
    import statsmodels.api as sm
    %matplotlib notebook
    import matplotlib.pyplot as plt
    import matplotlib.dates as mdates