I'm trying to calculate correlation coefficient for 2 datasets which are not of same length. The below code works only for equal length arrays.
import numpy as np
from scipy.stats.stats import pearsonr
a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
print pearsonr(a, b)
In my case the b
vector length can vary from 50 - 100 datpoints. While the function I want to match is standard. Attached image of a
. Is there any other preferred modules to match such patterns?
Little late for the party, but since this is a Google top result, I'll throw a possible answer to this problem:
import pandas as pd
from scipy.stats import pearsonr
import numpy as np
a = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
b = [ 0.4, 0.2, 0.5]
df = pd.DataFrame(dict(x=a))
CORR_VALS = np.array(b)
def get_correlation(vals):
return pearsonr(vals, CORR_VALS)[0]
df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)
pandas
DataFrames have rolling()
method that takes array length length (window
) as argument. The object that is returned from rolling()
has apply()
method that takes function as an argument. You can calculate for example the Pearson Correlation coefficient using pearsonr from scipy.stats.
In [2]: df['correlation'].values
Out[2]:
array([ nan, nan, -0.65465367, 0.94491118, -0.94491118,
0.98974332, -0.94491118, 0.9923356 , -0.18898224, -0.75592895,
-0.44673396, 0.1452278 , 0.78423011, 0.16661846])
In [1]: df
Out[1]:
x correlation
0 0.0 NaN
1 0.4 NaN
2 0.2 NaN
3 0.4 NaN
4 0.2 NaN
5 0.4 0.527932
6 0.2 -0.159167
7 0.5 0.189482