Search code examples
pythonnumpyvectorization

Is there a way to vectorize the calculation of correlation coefficients from this Numpy array?


This code computes the Pearson correlation coefficient for all possible pairs of L=45 element vectors taken from a stack of M=102272. The result is a symmetric MxM matrix occupying about 40 GB of memory. The memory requirement isn't a problem for my computer, but I estimate from test runs that the ~5 billion passes through the inner loop will take a good 2-3 days to complete. My question: Is there a straightforward way to vectorize the inner loop to speed things up significantly?

# L = 45
# M = 102272
# data[M,L]  (type 'float32')   

cmat = np.zeros((M,M))

for i in range(M):
    v1 = data[i,:]
    z1 = (v1-np.average(v1))/np.std(v1) 

    for j in range(i+1):
        v2 = data[j,:]
        z2 = (v2-np.average(v2))/np.std(v2)  
        cmmat[i,j] = cmmat[j,i] = z1.dot(z2)/L
    

Solution

  • There's a built-in numpy function that already exists to compute correlation matrix. Just use it!

    >>> import numpy as np
    >>> rng = np.random.default_rng(seed=42)
    >>> xarr = rng.random((3, 3))
    >>> xarr
    array([[0.77395605, 0.43887844, 0.85859792],
           [0.69736803, 0.09417735, 0.97562235],
           [0.7611397 , 0.78606431, 0.12811363]])
    >>> R1 = np.corrcoef(xarr)
    >>> R1
    array([[ 1.        ,  0.99256089, -0.68080986],
           [ 0.99256089,  1.        , -0.76492172],
           [-0.68080986, -0.76492172,  1.        ]])
    

    Documentation link