Search code examples
pythonperformancenumpymatrixpearson

Pearson correlation on big numpy matrices


I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values. My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it... If it helps this is my code... nothing special or hard..

def SimMat(mat,name):

    mrange = mat.shape[0]
    print "mrange:", mrange
    nTRs = mat.shape[1]
    print "nTRs:", nTRs

    SimM = numpy.zeros((mrange,mrange))



    for i in range(mrange):

        SimM[i][i] = 1

    for i in range (mrange):
        for j in range(i+1, mrange):
            pearV = scipy.stats.pearsonr(mat[i], mat[j])

            if(pearV[1] <= 0.05):
                if(pearV[0] >= 0.5):
                    print "Pearson value:", pearV[0]
                    SimM[i][j] = pearV[0]
                    SimM[j][i] = 0
            else:

                SimM[i][j] = SimM[j][i] = 0

    numpy.savetxt(name, SimM)




    return SimM, nTRs

Thanks


Solution

  • The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:

    import numpy as np
    import h5py
    from scipy.stats import pearsonr
    
    # Create the dataset
    h5 = h5py.File("data.h5",'w')
    h5["test"] = np.random.random(size=(24000,316))
    h5.close()
    
    # Compute dot products
    h5 = h5py.File("data.h5",'r+')
    A  = h5["test"][:]
    
    N   = A.shape[0]
    out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
    
    for i in range(N):
        out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
    

    Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.