Search code examples
pythonnumpystatisticsscipycorrelation

Correlation coefficients and p values for all pairs of rows of a matrix


I have a matrix data with m rows and n columns. I used to compute the correlation coefficients between all pairs of rows using np.corrcoef:

import numpy as np
data = np.array([[0, 1, -1], [0, -1, 1]])
np.corrcoef(data)

Now I would also like to have a look at the p-values of these coefficients. np.corrcoef doesn't provide these; scipy.stats.pearsonr does. However, scipy.stats.pearsonr does not accept a matrix on input.

Is there a quick way how to compute both the coefficient and the p-value for all pairs of rows (arriving e.g. at two m by m matrices, one with correlation coefficients, the other with corresponding p-values) without having to manually go through all pairs?


Solution

  • I have encountered the same problem today.

    After half an hour of googling, I can't find any code in numpy/scipy library can help me do this.

    So I wrote my own version of corrcoef

    import numpy as np
    from scipy.stats import pearsonr, betai
    
    def corrcoef(matrix):
        r = np.corrcoef(matrix)
        rf = r[np.triu_indices(r.shape[0], 1)]
        df = matrix.shape[1] - 2
        ts = rf * rf * (df / (1 - rf * rf))
        pf = betai(0.5 * df, 0.5, df / (df + ts))
        p = np.zeros(shape=r.shape)
        p[np.triu_indices(p.shape[0], 1)] = pf
        p[np.tril_indices(p.shape[0], -1)] = p.T[np.tril_indices(p.shape[0], -1)]
        p[np.diag_indices(p.shape[0])] = np.ones(p.shape[0])
        return r, p
    
    def corrcoef_loop(matrix):
        rows, cols = matrix.shape[0], matrix.shape[1]
        r = np.ones(shape=(rows, rows))
        p = np.ones(shape=(rows, rows))
        for i in range(rows):
            for j in range(i+1, rows):
                r_, p_ = pearsonr(matrix[i], matrix[j])
                r[i, j] = r[j, i] = r_
                p[i, j] = p[j, i] = p_
        return r, p
    

    The first version use the result of np.corrcoef, and then calculate p-value based on triangle-upper values of corrcoef matrix.

    The second loop version just iterating over rows, do pearsonr manually.

    def test_corrcoef():
        a = np.array([
            [1, 2, 3, 4],
            [1, 3, 1, 4],
            [8, 3, 8, 5],
            [2, 3, 2, 1]])
    
        r1, p1 = corrcoef(a)
        r2, p2 = corrcoef_loop(a)
    
        assert np.allclose(r1, r2)
        assert np.allclose(p1, p2)
    

    The test passed, they are the same.

    def test_timing():
        import time
        a = np.random.randn(100, 2500)
    
        def timing(func, *args, **kwargs):
            t0 = time.time()
            loops = 10
            for _ in range(loops):
                func(*args, **kwargs)
            print('{} takes {} seconds loops={}'.format(
                func.__name__, time.time() - t0, loops))
    
        timing(corrcoef, a)
        timing(corrcoef_loop, a)
    
    
    if __name__ == '__main__':
        test_corrcoef()
        test_timing()
    

    The performance on my Macbook against 100x2500 matrix

    corrcoef takes 0.06608104705810547 seconds loops=10

    corrcoef_loop takes 7.585600137710571 seconds loops=10