Search code examples
pythonpandasstatsmodels

What's the difference between pandas ACF and statsmodel ACF?


I'm calculating the Autocorrelation Function for a stock's returns. To do so I tested two functions, the autocorr function built into Pandas, and the acf function supplied by statsmodels.tsa. This is done in the following MWE:

import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
import datetime
from dateutil.relativedelta import relativedelta
from statsmodels.tsa.stattools import acf, pacf

ticker = 'AAPL'
time_ago = datetime.datetime.today().date() - relativedelta(months = 6)

ticker_data = data.get_data_yahoo(ticker, time_ago)['Adj Close'].pct_change().dropna()
ticker_data_len = len(ticker_data)

ticker_data_acf_1 =  acf(ticker_data)[1:32]
ticker_data_acf_2 = [ticker_data.autocorr(i) for i in range(1,32)]

test_df = pd.DataFrame([ticker_data_acf_1, ticker_data_acf_2]).T
test_df.columns = ['Pandas Autocorr', 'Statsmodels Autocorr']
test_df.index += 1
test_df.plot(kind='bar')

What I noticed was the values they predicted weren't identical:

enter image description here

What accounts for this difference, and which values should be used?


Solution

  • The difference between the Pandas and Statsmodels version lie in the mean subtraction and normalization / variance division:

    • autocorr does nothing more than passing subseries of the original series to np.corrcoef. Inside this method, the sample mean and sample variance of these subseries are used to determine the correlation coefficient
    • acf, in contrary, uses the overall series sample mean and sample variance to determine the correlation coefficient.

    The differences may get smaller for longer time series but are quite big for short ones.

    Compared to Matlab, the Pandas autocorr function probably corresponds to doing Matlabs xcorr (cross-corr) with the (lagged) series itself, instead of Matlab's autocorr, which calculates the sample autocorrelation (guessing from the docs; I cannot validate this because I have no access to Matlab).

    See this MWE for clarification:

    import numpy as np
    import pandas as pd
    from statsmodels.tsa.stattools import acf
    import matplotlib.pyplot as plt
    plt.style.use("seaborn-colorblind")
    
    def autocorr_by_hand(x, lag):
        # Slice the relevant subseries based on the lag
        y1 = x[:(len(x)-lag)]
        y2 = x[lag:]
        # Subtract the subseries means
        sum_product = np.sum((y1-np.mean(y1))*(y2-np.mean(y2)))
        # Normalize with the subseries stds
        return sum_product / ((len(x) - lag) * np.std(y1) * np.std(y2))
    
    def acf_by_hand(x, lag):
        # Slice the relevant subseries based on the lag
        y1 = x[:(len(x)-lag)]
        y2 = x[lag:]
        # Subtract the mean of the whole series x to calculate Cov
        sum_product = np.sum((y1-np.mean(x))*(y2-np.mean(x)))
        # Normalize with var of whole series
        return sum_product / ((len(x) - lag) * np.var(x))
    
    x = np.linspace(0,100,101)
    
    results = {}
    nlags=10
    results["acf_by_hand"] = [acf_by_hand(x, lag) for lag in range(nlags)]
    results["autocorr_by_hand"] = [autocorr_by_hand(x, lag) for lag in range(nlags)]
    results["autocorr"] = [pd.Series(x).autocorr(lag) for lag in range(nlags)]
    results["acf"] = acf(x, unbiased=True, nlags=nlags-1)
    
    pd.DataFrame(results).plot(kind="bar", figsize=(10,5), grid=True)
    plt.xlabel("lag")
    plt.ylim([-1.2, 1.2])
    plt.ylabel("value")
    plt.show()
    

    enter image description here

    Statsmodels uses np.correlate to optimize this, but this is basically how it works.