python pandas numpy matrix pearson-correlation

How can I compute the Pearson correlation matrix and retain only significant values?

I have a 4-by-3 matrix, X, and wish to form the 3-by-3 Pearson correlation matrix, C, obtained by computing correlations between all 3 possible column combinations of X. However, entries of C that correspond to correlations that aren't statistically significant should be set to zero.

I know how to get pair-wise correlations and significance values using pearsonr in scipy.stats. For example,

import numpy as np
from scipy.stats.stats import pearsonr

X = np.array([[1, 1, -2], [0, 0, 0], [0, .2, 1], [5, 3, 4]])
pearsonr(X[:, 0], X[:, 1])

returns (0.9915008164289165, 0.00849918357108348), a correlation of about .9915 between columns one and two of X, with p-value .0085.

I could easily get my desired matrix using nested loops:

Pre-populate C as a 3-by-3 matrix of zeros.
Each pass of the nested loop will correspond to two columns of X. The entry of C corresponding to this pair of columns will be set to the pairwise correlation provided the p-value is less than or equal to my threshold, say .01.

I'm wondering if there's a simpler way. I know in Pandas, I can create the correlation matrix, C, in basically one line:

import pandas as pd

df = pd.DataFrame(data=X)
C_frame = df.corr(method='pearson') 
C = C_frame.to_numpy()

Is there a way to get the matrix or data frame of p-values, P, without a loop? If so, how could I set each entry of C to zero should the corresponding p-value in P exceed my threshold?

Solution

Looking through the docs for pearsonr reveals the fomulae used to compute the correlations. It should not be too difficult to get the correlations between each column of a matrix using vectorization.

While you could compute the value of C using pandas, I will show pure numpyan implementation for the entire process.

First, compute the r-values:

X = np.array([[1,  1, -2],
              [0,  0,  0],
              [0, .2,  1],
              [5,  3,  4]])
n = X.shape[0]

X -= X.mean(axis=0)
s = (X**2).sum(axis=0)
r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])

Computing the p values is made simple given the existence of the beta distribution in scipy. Taken directly from the docs:

dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

You can trivially make a mask from p with your threshold, and apply it to r to make C:

mask = (p <= 0.01)
C = np.zeros_like(r)
C[mask] = r[mask]

A better option would probably be to modify your r in-place:

r[p > 0.1] = 0

In function form:

def non_trivial_correlation(X, threshold=0.1):
    n = X.shape[0]
    X = X - X.mean(axis=0) # Don't modify the original
    x = (X**2).sum(axis=0)
    r = (X[..., None] * X[..., None, :]).sum(axis=0) / np.sqrt(s[:, None] * s[None, :])
    p = 2 * scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2).cdf(-abs(r))
    r[p > threshold] = 0
    return r