Search code examples
pythonpandasstatsmodelsautocorrelation

Why am I getting different autocorrelation results from different libraries?


  1. Why am I getting different autocorrelation results from different libraries?
  2. Which one is correct?
import numpy as np
from scipy import signal

# Given data
data = np.array([1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33])

# Compute the autocorrelation using scipy's correlate function
autocorrelations = signal.correlate(data, data, mode='full')

# The middle of the autocorrelations array is at index len(data)-1
mid_index = len(data) - 1

# Show autocorrelation values for lag=1,2,3,4,...
print(autocorrelations[mid_index + 1:])

Output:

[21.2425 17.285  13.4525  9.8075  6.4125  3.33  ]

import pandas as pd

# Given data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Convert data to pandas Series
series = pd.Series(data)

# Compute and print autocorrelation for lags 1 to length of series - 1
for lag in range(0, len(data)):
    print(series.autocorr(lag=lag))

Output:

1.0
0.9374115462038415
0.9287843240596312
0.9260849979667674
0.9407970411588671
0.9999999999999999

from statsmodels.tsa.stattools import acf

# Your data
data = [1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 3.33]

# Calculate the autocorrelation using the acf function
autocorrelation = acf(data, nlags=len(data)-1, fft=True)

# Display the autocorrelation coefficients for lags 1,2,3,4,...
print(autocorrelation)

Output:

[ 1.       0.39072553  0.13718689 -0.08148897 -0.24787067 -0.3445268 -0.35402598]

Solution

  • "They are likely each correct according their chosen definition of autocorrelation. Edge discontinuity effects at start and end of the array dominates short runs of data." - @Martin Brown in the comments.

    Scipy's correlate function: Documentation

    Scipy's signal.correlate function computes the cross-correlation of two sequences. In this case, since you're providing the same data for both sequences, it calculates the autocorrelation. The output is a continuous sequence of autocorrelation values, and you are extracting the positive lags.

    Pandas Series autocorr method: Documentation

    Pandas' autocorr method computes the Pearson correlation coefficient between the Series and a lagged version of itself. It uses a formula that involves mean normalization. The output includes the autocorrelation at lag 0 (which is always 1.0) and positive lags.

    Statsmodels acf function: Documentation

    Statsmodels' acf function calculates the autocorrelation function (ACF) using either the biased or unbiased method. The default is the biased method (fft=True), which normalizes by the number of observations. The output includes the autocorrelation at lag 0 and positive lags.

    1. Why you are seeing different results? Different implementations (see the docs for details)!
    2. Now, which one is correct depends on what you mean by "correct" in the context of your analysis.