python numpy math pattern-matching pearson-correlation

Numpy/Pandas correlate 2 arrays of different length

I'm trying to calculate correlation coefficient for 2 datasets which are not of same length. The below code works only for equal length arrays.

import numpy as np
from scipy.stats.stats import pearsonr

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]

print pearsonr(a, b)

In my case the b vector length can vary from 50 - 100 datpoints. While the function I want to match is standard. Attached image of a. Is there any other preferred modules to match such patterns?

Solution

Little late for the party, but since this is a Google top result, I'll throw a possible answer to this problem:

import pandas as pd
from scipy.stats import pearsonr 
import numpy as np 


a = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
b = [ 0.4, 0.2, 0.5]


df = pd.DataFrame(dict(x=a))

CORR_VALS = np.array(b)
def get_correlation(vals):
    return pearsonr(vals, CORR_VALS)[0]

df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)

Explanation

pandas DataFrames have rolling() method that takes array length length (window) as argument. The object that is returned from rolling() has apply() method that takes function as an argument. You can calculate for example the Pearson Correlation coefficient using pearsonr from scipy.stats.

Example output

In [2]: df['correlation'].values
Out[2]:
array([        nan,         nan, -0.65465367,  0.94491118, -0.94491118,
        0.98974332, -0.94491118,  0.9923356 , -0.18898224, -0.75592895,
       -0.44673396,  0.1452278 ,  0.78423011,  0.16661846])

With the example data in the question

In [1]: df
Out[1]:
     x  correlation
0  0.0          NaN
1  0.4          NaN
2  0.2          NaN
3  0.4          NaN
4  0.2          NaN
5  0.4     0.527932
6  0.2    -0.159167
7  0.5     0.189482