Search code examples
pythonnumpymathpattern-matchingpearson-correlation

Numpy/Pandas correlate 2 arrays of different length


I'm trying to calculate correlation coefficient for 2 datasets which are not of same length. The below code works only for equal length arrays.

import numpy as np
from scipy.stats.stats import pearsonr

a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]

print pearsonr(a, b)

In my case the b vector length can vary from 50 - 100 datpoints. While the function I want to match is standard. Attached image of a. Is there any other preferred modules to match such patterns?

enter image description here


Solution

  • Little late for the party, but since this is a Google top result, I'll throw a possible answer to this problem:

    import pandas as pd
    from scipy.stats import pearsonr 
    import numpy as np 
    
    
    a = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
    b = [ 0.4, 0.2, 0.5]
    
    
    df = pd.DataFrame(dict(x=a))
    
    CORR_VALS = np.array(b)
    def get_correlation(vals):
        return pearsonr(vals, CORR_VALS)[0]
    
    df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)
    
    

    Explanation

    pandas DataFrames have rolling() method that takes array length length (window) as argument. The object that is returned from rolling() has apply() method that takes function as an argument. You can calculate for example the Pearson Correlation coefficient using pearsonr from scipy.stats.

    Example output

    In [2]: df['correlation'].values
    Out[2]:
    array([        nan,         nan, -0.65465367,  0.94491118, -0.94491118,
            0.98974332, -0.94491118,  0.9923356 , -0.18898224, -0.75592895,
           -0.44673396,  0.1452278 ,  0.78423011,  0.16661846])
    

    enter image description here

    With the example data in the question

    In [1]: df
    Out[1]:
         x  correlation
    0  0.0          NaN
    1  0.4          NaN
    2  0.2          NaN
    3  0.4          NaN
    4  0.2          NaN
    5  0.4     0.527932
    6  0.2    -0.159167
    7  0.5     0.189482