How to calculate a correlation with p-Values most performant in Python?

I want to create a correlation of my data with its p-Values. Currently I am using Pandas with its corr method on a DataFrame. The problem is that this correlation method doesn't provide the p-Values.

So I tried to use two answers to this question: Stackoverflow Question. Both solutions use the scipy.stats.pearsonr method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2). It gets the result I want but takes a huge amount of time.

In comparison: My pandas only correlation takes about 4 seconds from creating the DataFrame and calculate the correlation. Solution 2 takes about 6 minutes to return the result. My guess is that the new creation of a DataFrame needs heavy calculations and so the time gets summed up for my dataset.

Is there a more performant way to calculate this result? Pandas corr also has to do this in the background to handle my None values, so there has to be a better solution.

My testing dataset has 500 rows with 550 values each. And as I said also have None values.

Solution

Solving your problem requires both math and programming. Since df.corr returns pretty quickly in your case, I will focus on the calculation of p-value.

Programming

scipy.stats.pearsonr(col_x, col_y) does not like dealing with NaN. So for each column pair, you have to remove all rows where one or both elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.

If you view its source code, DataFrame.corr does it so blisteringly fast for 2 reasons:

It's coded in Cython and hence ran outside the Global Interpreter Lock (GIL). That means the looping is in bare-metal C.
It uss Welford's methodand does not rely on scipy.stats. The complexity of this algorithm is O(n * m^2) where n is the number of rows and m is the number of columns.

Math

The pearsonr documentation provides a description on how the p-value is calculated:

r = <Pearson correlation coefficient>
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))

Wikipedia gives us the formula for the CDF of the Beta distribution:

cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
                    = scipy.special.betainc(alpha, beta, x)

Fortunately for us, the betainc function is vectorized so if we pass in 3 arrays of the same length as alpha, beta and x, it will return an array as the output.

Solution 1

This solution is in native Python and gives a reasonable performance on your dataset (500 * 550). Took about 30 seconds on my 2014 iMac with 16GB of RAM:

import scipy.special

def corr1(df):
    mask = df.notna().to_numpy()
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays for better performance
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=object)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    # For 2-combination of columns, let `n` be the number of pairs whose
    # elements are all non-NaN. We will need that later to calculate the
    # p-value
    k = -1
    for i in range(n_cols):
        for j in range(i):
            n = 0
            for row in range(n_rows):
                n += 1 if mask[row, i] and mask[row, j] else 0

            k += 1
            idx[k] = (i, j)
            correl[k] = corr[i,j]
            count[k] = n

    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the inputs a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2
    x = np.where(correl < 0, x, 1 - x)
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    
    return idx, correl, p

# Manipulate the return values into the right format
index, corr, p = corr1(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()

Solution 2

For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. I am sure further optimization is possible, but I'm too lazy to explore them. The trade-off is a more complex build and deployment process.

First, make sure you have a C compiler. Then install the Cython package:

pip install cython

Next, make 2 files: setup.py and utility.pyx:

#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy

compiler_directives = {
    'language_level': '3'
}
setup(
    ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
    include_dirs=[numpy.get_include()]
)

#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t

import numpy as np
from numpy cimport (
    ndarray,
    float64_t,
    uint8_t,
    uint64_t,
)

import scipy.special

@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
    # These variables go into the `nogil` context (i.e. into C land) so they
    # must be statically typed
    cdef:
        Py_ssize_t n_rows, n_cols, i, j, row, n, k
        ndarray[uint8_t, ndim=2] mask
        ndarray[float64_t, ndim=2] corr
        #
        ndarray[uint64_t, ndim=2] idx
        ndarray[float64_t, ndim=1] correl
        ndarray[uint64_t, ndim=1] count

    # We are still in Python land and thus have full access to all functions in
    # numpy and pandas. Converting pandas dataframes to a 2D numpy array
    # gives a huge speed boost
    mask = df.notna().to_numpy(dtype='uint8')
    corr = df.corr().to_numpy()
    n_rows, n_cols = df.shape

    # Initialize the return arrays
    length = int(n_cols * (n_cols - 1) / 2)
    idx = np.empty((length, 2), dtype=np.uint64)
    correl = np.empty(length, dtype=np.float64)
    count = np.empty(length, dtype=np.uint64)

    with nogil:
        # It's C-land in here. Everything must be statically defined

        k = -1
        # For 2-combination of columns, let `n` be the number of pairs whose
        # elements are all non-NaN. We will need that later to calculate the
        # p-value
        for i in range(n_cols):
            for j in range(i):
                n = 0
                for row in range(n_rows):
                    n += 1 if mask[row, i] and mask[row, j] else 0
            
                k += 1
                idx[k, 0] = i
                idx[k, 1] = j
                correl[k] = corr[i,j]
                count[k] = n
    
    # Back to Python-land
    # The p-value can be obtained with the incomplete Beta function (betainc)
    # We just need to massage the input a little
    alpha = count / 2 - 1
    x = (correl + 1) / 2                # since -1 <= correl <= 1, this makes 0 <= x <= 1
    x = np.where(correl < 0, x, 1 - x)  # don't ask me why. It's half-wrong and half-right without this line
    p = 2 * scipy.special.betainc(alpha, alpha, x)
    return idx, correl, p

Next, compile the utility.pyx into machine code:

python setup.py build_ext --inplace

Then you can use utility just like any other Python module:

from utility import corr2

index, corr, p = corr2(df)

idx = pd.MultiIndex.from_tuples(
    [(df.columns[i], df.columns[j]) for i, j in index] +
    [(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
    'corr': np.tile(corr, 2),
    'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()