python statsmodels causality vector-auto-regression

How to test for Granger Causality in Python (Toda & Yamamoto style)?

I am trying to implement the process for Granger Causality testing outlined in this blogpost by Dave Giles, which I understand is a famous post about performing a Granger Causality test for non-stationary data, following the Toda-Yamamoto method. He works in EViews, but I would really like to do the same steps in Python.

In Giles' example, he uses two time-series variables. First, he determines the order of integration of both variables using both the ADF test and KPSS test, and then he sets up a VAR model on the levels of the data, using FPE, AIC, SC, and HQ information criteria to determine the maximum lag length included in the VAR model. These steps cover steps 1-4 in the blog post.

All of the above I can reproduce in Python using the statsmodels library's .tsa.vector_ar.var_model.VAR() and .tsa.vector_ar.var_model.VAR.fit(). I am not sure how to implement his step 5, where he "appl[ies] the LM test [not sure what this stands for] for serial independence against the alternative of AR(k)/MA(k), for k = 1, ...., 12,..." and also checks if the model is dynamically stable. I am also not certain how to perform step 6, which is a Johansen's Trace Test and Max. Eigenvalue Test which test for cointegration between the 2 time-series variables. I see there is a method for Johansen cointegration for a VECM, but how would I incorporate that for a VAR?

I also can't seem to figure out how to include an exogenous variable in the VAR model corresponding to the maximum lag length plus the maximum order of integration, as he does at the end of the blog post. He does this to ensure the Wald test maintains an asymptotic chi-square null distribution.

I know that both statsmodels.tsa.stattools.grangercausalitytests() and statsmodels.tsa.vector_ar.var_model.VARResults.test_causality() exist, but I would like to be able to correctly perform these tests on two time-series variables, which necessitates following the procedure Giles outlines (or a procedure similar to it).

Any advice would be appreciated! Please let me know if I need to include some example code.

Solution

Step 5:

You want to check the ACF plot of the residuals. Assuming you already have found the best model from step 4.

from statsmodels.tsa.api import VAR
best_model = VAR(data)
best_model_fitted = best_model.fit(best_lag)

You then want to get the residuals of this fitted model. And do an ACF plot.

from statsmodels.graphics.tsaplots import plot_acf
residuals = data - best_model_fitted.fittedvalues
plot_acf(residuals[col], lags=20)

If there are no significant auto-correlations then your VAR is well-specified (i.e. the best_lag you had selected was correct).

Step 6:

I didn't have to perform the cointegration test as my time series were of different orders of integration. But statsmodels has a method for performing this test.

Step 7:

You're right in that statsmodels.tsa.stattools.grangercausalitytests() does not give you the flexibility to perform the test in the way specified in the blog post. But fret not, the granger causality test uses a Wald test under the hood. Since VAR models are essentially a set of linear regressions that regress against lags of the dependent variable and lags of other time series variables, we can use a normal linear regression model and then perform the wald_test off the RegressionResults object.

For example, assume that you have p = 4 and d = 2, then your VAR for the first time series (which has terms representing lagged versions of itself and lagged versions of the other time series) would look like $y_{1,t}=c_1+\phi_{11,1}y_{1,t-1}+\phi_{11,2}y_{1,t-2}+\phi_{11,3}y_{1,t-3}+\phi_{11,4}y_{1,t-4}+\phi_{11,5}y_{1,t-5}+\phi_{11,6}y_{1,t-6}\ +\phi_{12,1}y_{2,t-1}+\phi_{12,2}y_{2,t-2}+\phi_{12,3}y_{2,t-3}+\phi_{12,4}y_{2,t-4}+\phi_{12,5}y_{2,t-5}+\phi_{12,6}y_{2,t-6}$

So we need to build a dataset that has columns representing lagged values of y1 and y2.

# assume df has two columns: y1 and y2
# Create lagged columns
lags = [1, 2, 3, 4, 5, 6]

for col in ['y1', 'y2']:
    for lag in lags:
        df[f'{col}_lag{lag}'] = df[col].shift(lag)

df.drop(columns=['y2'], inplace=True)
df.dropna(inplace=True)

Y = df['y1']
X = df[['y1_lag1', 'y1_lag2', 'y1_lag3',
        'y1_lag4', 'y1_lag5', 'y1_lag6', 
        'y2_lag1', 'y2_lag2', 'y2_lag3', 
        'y2_lag4', 'y2_lag5', 'y2_lag6']]

Fit the model. Here you can sanity check that the parameters you get from the linear regression are the same as the ones you get for the y1 part of the VAR object if you'd like.

import statsmodels.api as sm

X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
results.params

Now we are ready to perform our Wald test in the way specified in the blog. In this example I said that p = 4, d = 2 so we want to set the coefficients corresponding to the first 4 lags of y2 to 0.

# Perform Wald test
hypothesis = 'y2_lag1 = y2_lag2 = y2_lag3 = y2_lag4 = 0'
wald_test = results.wald_test(hypothesis)

print(wald_test)

A p value will be printed out and if it > 0.05, then you do not have enough evidence to reject the null hypothesis meaning that you cannot say that y2 "Granger causes" y1.