Search code examples
pythonpandasnumpymatrixpearson-correlation

How can I compute the Pearson correlation matrix and retain only significant values?


I have a 4-by-3 matrix, X, and wish to form the 3-by-3 Pearson correlation matrix, C, obtained by computing correlations between all 3 possible column combinations of X. However, entries of C that correspond to correlations that aren't statistically significant should be set to zero.

I know how to get pair-wise correlations and significance values using pearsonr in scipy.stats. For example,

import numpy as np
from scipy.stats.stats import pearsonr

X = np.array([[1, 1, -2], [0, 0, 0], [0, .2, 1], [5, 3, 4]])
pearsonr(X[:, 0], X[:, 1])

returns (0.9915008164289165, 0.00849918357108348), a correlation of about .9915 between columns one and two of X, with p-value .0085.

I could easily get my desired matrix using nested loops:

  1. Pre-populate C as a 3-by-3 matrix of zeros.
  2. Each pass of the nested loop will correspond to two columns of X. The entry of C corresponding to this pair of columns will be set to the pairwise correlation provided the p-value is less than or equal to my threshold, say .01.

I'm wondering if there's a simpler way. I know in Pandas, I can create the correlation matrix, C, in basically one line:

import pandas as pd

df = pd.DataFrame(data=X)
C_frame = df.corr(method='pearson') 
C = C_frame.to_numpy()

Is there a way to get the matrix or data frame of p-values, P, without a loop? If so, how could I set each entry of C to zero should the corresponding p-value in P exceed my threshold?


Solution

  • Looking through the docs for pearsonr reveals the fomulae used to compute the correlations. It should not be too difficult to get the correlations between each column of a matrix using vectorization.

    While you could compute the value of C using pandas, I will show pure numpyan implementation for the entire process.

    First, compute the r-values:

    X = np.array([[1,  1, -2],
                  [0,  0,  0],
                  [0, .2,  1],
                  [5,  3,  4]])
    n = X.shape[0]
    
    X -= X.mean(axis=0)
    s = (X**2).sum(axis=0)
    r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
    

    Computing the p values is made simple given the existence of the beta distribution in scipy. Taken directly from the docs:

    dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
    p = 2 * dist.cdf(-abs(r))
    

    You can trivially make a mask from p with your threshold, and apply it to r to make C:

    mask = (p <= 0.01)
    C = np.zeros_like(r)
    C[mask] = r[mask]
    

    A better option would probably be to modify your r in-place:

    r[p > 0.1] = 0
    

    In function form:

    def non_trivial_correlation(X, threshold=0.1):
        n = X.shape[0]
        X = X - X.mean(axis=0) # Don't modify the original
        x = (X**2).sum(axis=0)
        r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
        p = 2 * scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2).cdf(-abs(r))
        r[p > threshold] = 0
        return r