I need to use numpy functions to replace all Pandas functions, but the Pandas package did not explain well how pd.autocorr()
is implemented.
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({'A': np.random.random(20)})
x = df.rolling(5).apply(lambda x: x.autocorr(), raw=True).dropna()
y = []
for i in range(15):
y.append( np.corrcoeff(df['A'][i:i+5],df['A'][i+1:i+6])[0,1] )
# np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),df['A'][(1+i):(6+i)]-df['A'][(1+i):(6+i)].mean(),'valid')[0]
# np.correlate(df['A'][i:i+5]-df['A'][i:i+5].mean(),np.flip(df['A'][(1+i):(6+i)])-df['A'][(1+i):(6+i)].mean(),'valid')[0]
The pd.autocorr()
result is quite different from that of np.corrcoef()
(I treid np.correlate()
as well).
Is there any way I can use numpy only functions to achieve the same reulst as pd.autocorr()
?
----------------- Example result added ----------------
df['A'] = [0.5314742325906894, 0.7424912257400176, 0.2895649008872213, 0.16967710120380175, 0.5157732179121193, 0.8733423106397956, 0.585705172096987, 0.1387299202733231, 0.18540514459343538, 0.13913104211564564, 0.736937228263526, 0.20944078980434988, 0.2826810751427198, 0.15055686873748197, 0.4159491505728884, 0.07600226975854041, 0.15279939462562298, 0.1405723553409276, 0.8372449734938123, 0.3314986851097367]
x = [0.010637545587524432, 0.03594106077726333, 0.40104877005219836, -0.009106549297130558, 0.4008385963492408, 0.7794761931857483, -0.4182779136016351, -0.2962696925038811, -0.4083361773384266, -0.5244693987698964, -0.5063605533618415, -0.9496936641021706, -0.5303040575891907, -0.42881675192105184, -0.3371366910961831, -0.036231529863559424]
y = [0.11823200733266746, 0.16166841984627847, 0.2033980627120384, 0.2861039403548347, 0.5239653859040245, 0.1602079943122044, -0.3920837265006942, -0.28176746883177917, -0.3604612671108854, -0.5347077109231272, -0.4702461092101919, -0.5287673078857449, -0.4501452367448014, -0.3538574959825232, -0.10013342594129321]
If we check the doc of the pandas.Series.autocorr, if you call the function with default arguments, the lag
is 1, which means you need to shift one element for calculating the correlation.
For example:
a = np.array([0.25, 0.5, 0.2, -0.05])
s = pd.Series(a)
gives you :
0.1035526330902407
With np.corrcoef
you need to slice the array to two arrays shifted :
np.corrcoef(a[:-1], a[1:])[0, 1]
Which gives you same result:
0.1035526330902407
So in your case the codes should be like :
W = 5 # Window size
nrows = len(df) - W + 1 # number of elemnets after rolling
lag=1
y = []
for i in range(nrows):
y.append(np.corrcoef(df['A'][i:i+W-lag],df['A'][i+lag:i+W])[0,1])
You will get same result as x
.