Search code examples
python-2.7regressionfinancestatsmodels

Regression analysis,using statsmodels


Please help me for getting output from this code.why the output of this code is nan?!!!whats my wrong?

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import math
import datetime as dt
#importing Data
es_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/hbrbcpe.txt'
vs_url = 'https://www.stoxx.com/document/Indices/Current/HistoricalData/h_vstoxx.txt'
#creating DataFrame
cols=['SX5P','SX5E','SXXP','SXXE','SXXF','SXXA','DK5f','DKXF']
es=pd.read_csv(es_url,index_col=0,parse_dates=True,sep=';',dayfirst=True,header=None,skiprows=4,names=cols)
vs=pd.read_csv(vs_url,index_col=0,header=2,parse_dates=True,sep=',',dayfirst=True)
data=pd.DataFrame({'EUROSTOXX' : es['SX5E'][es.index > dt.datetime(1999,1,1)]},dtype=float)
data=data.join(pd.DataFrame({'VSTOXX' : vs['V2TX'][vs.index > dt.datetime(1999,1,1)]},dtype=float))
data=data.fillna(method='ffill')
rets=(((data/data.shift(1))-1)*100).round(2)
xdat = rets['EUROSTOXX']
ydat = rets['VSTOXX']
#regression analysis
model = smf.ols('ydat ~ xdat',data=rets).fit()
print model.summary()

Solution

  • The problem is, when you compute rets, you divide by zero which causes an inf. Also, when you use shift, you have NaNs so you have missing values that need to be handled in some way first before proceeding to the regression.

    Walk through this example using your data and see:

    df = data.loc['2016-03-20':'2016-04-01'].copy()
    

    df looks like:

                EUROSTOXX   VSTOXX
    2016-03-21    3048.77  35.6846
    2016-03-22    3051.23  35.6846
    2016-03-23    3042.42  35.6846
    2016-03-24    2986.73  35.6846
    2016-03-25       0.00  35.6846
    2016-03-28       0.00  35.6846
    2016-03-29    3004.87  35.6846
    2016-03-30    3044.10  35.6846
    2016-03-31    3004.93  35.6846
    2016-04-01    2953.28  35.6846
    

    Shifting by 1 and dividing:

    df = (((df/df.shift(1))-1)*100).round(2)
    

    Prints out:

                 EUROSTOXX  VSTOXX
    2016-03-21         NaN     NaN
    2016-03-22    0.080688     0.0
    2016-03-23   -0.288736     0.0
    2016-03-24   -1.830451     0.0
    2016-03-25 -100.000000     0.0
    2016-03-28         NaN     0.0
    2016-03-29         inf     0.0
    2016-03-30    1.305547     0.0
    2016-03-31   -1.286751     0.0
    2016-04-01   -1.718842     0.0
    

    Take-aways: shifting by 1 automatically always creates a NaN at the top. Dividing 0.00 by 0.00 produces an inf.

    One possible solution to handle missing values:

    ...
    xdat = rets['EUROSTOXX']
    ydat = rets['VSTOXX']
    
    # handle missing values
    messed_up_indices = xdat[xdat.isin([-np.inf, np.inf, np.nan]) == True].index
    xdat[messed_up_indices] = xdat[messed_up_indices].replace([-np.inf, np.inf], np.nan)
    xdat[messed_up_indices] = xdat[messed_up_indices].fillna(xdat.mean())
    ydat[messed_up_indices] = ydat[messed_up_indices].fillna(0.0)
    
    #regression analysis
    model = smf.ols('ydat ~ xdat',data=rets, missing='raise').fit()
    print(model.summary())
    

    Notice I added the missing='raise' parameter to ols to see what's going on.

    End result prints out:

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                   ydat   R-squared:                       0.259
    Model:                            OLS   Adj. R-squared:                  0.259
    Method:                 Least Squares   F-statistic:                     1593.
    Date:                Wed, 03 Jan 2018   Prob (F-statistic):          5.76e-299
    Time:                        12:01:14   Log-Likelihood:                -13856.
    No. Observations:                4554   AIC:                         2.772e+04
    Df Residuals:                    4552   BIC:                         2.773e+04
    Df Model:                           1                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept      0.1608      0.075      2.139      0.033       0.013       0.308
    xdat          -1.4209      0.036    -39.912      0.000      -1.491      -1.351
    ==============================================================================
    Omnibus:                     4280.114   Durbin-Watson:                   2.074
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4021394.925
    Skew:                          -3.446   Prob(JB):                         0.00
    Kurtosis:                     148.415   Cond. No.                         2.11
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.