Search code examples
pythonpandasnumpycorrelationautocorrelation

How can I obtain the same result as pandas.autocorr() by numpy?


I need to use numpy functions to replace all Pandas functions, but the Pandas package did not explain well how pd.autocorr() is implemented.

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict({'A': np.random.random(20)})
x = df.rolling(5).apply(lambda x: x.autocorr(), raw=True).dropna()
y = []
for i in range(15):
  y.append( np.corrcoeff(df['A'][i:i+5],df['A'][i+1:i+6])[0,1] )
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),df['A'][(1+i):(6+i)]-df['A'][(1+i):(6+i)].mean(),'valid')[0]
  # np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),np.flip(df['A'][(1+i):(6+i)])-df['A'][(1+i):(6+i)].mean(),'valid')[0]

The pd.autocorr() result is quite different from that of np.corrcoef() (I treid np.correlate() as well). Is there any way I can use numpy only functions to achieve the same reulst as pd.autocorr()?

----------------- Example result added ----------------

df['A'] = [0.5314742325906894, 0.7424912257400176, 0.2895649008872213, 0.16967710120380175, 0.5157732179121193, 0.8733423106397956, 0.585705172096987, 0.1387299202733231, 0.18540514459343538, 0.13913104211564564, 0.736937228263526, 0.20944078980434988, 0.2826810751427198, 0.15055686873748197, 0.4159491505728884, 0.07600226975854041, 0.15279939462562298, 0.1405723553409276, 0.8372449734938123, 0.3314986851097367]

x = [0.010637545587524432, 0.03594106077726333, 0.40104877005219836, -0.009106549297130558, 0.4008385963492408, 0.7794761931857483, -0.4182779136016351, -0.2962696925038811, -0.4083361773384266, -0.5244693987698964, -0.5063605533618415, -0.9496936641021706, -0.5303040575891907, -0.42881675192105184, -0.3371366910961831, -0.036231529863559424]

y = [0.11823200733266746, 0.16166841984627847, 0.2033980627120384, 0.2861039403548347, 0.5239653859040245, 0.1602079943122044, -0.3920837265006942, -0.28176746883177917, -0.3604612671108854, -0.5347077109231272, -0.4702461092101919, -0.5287673078857449, -0.4501452367448014, -0.3538574959825232, -0.10013342594129321]

Solution

  • If we check the doc of the pandas.Series.autocorr, if you call the function with default arguments, the lag is 1, which means you need to shift one element for calculating the correlation.

    For example:

    a = np.array([0.25, 0.5, 0.2, -0.05])
    s = pd.Series(a)
    

    gives you :

    0.1035526330902407
    

    With np.corrcoef you need to slice the array to two arrays shifted :

    np.corrcoef(a[:-1], a[1:])[0, 1]
    

    Which gives you same result:

    0.1035526330902407
    

    So in your case the codes should be like :

    W = 5 # Window size
    nrows = len(df) - W + 1 # number of elemnets after rolling
    lag=1
    y = []
    for i in range(nrows):
        y.append(np.corrcoef(df['A'][i:i+W-lag],df['A'][i+lag:i+W])[0,1])
    

    You will get same result as x.