Search code examples
pythonpandascorrelation

How to calculate a correlation with p-Values most performant in Python?


I want to create a correlation of my data with its p-Values. Currently I am using Pandas with its corr method on a DataFrame. The problem is that this correlation method doesn't provide the p-Values.

So I tried to use two answers to this question: Stackoverflow Question. Both solutions use the scipy.stats.pearsonr method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2). It gets the result I want but takes a huge amount of time.

In comparison: My pandas only correlation takes about 4 seconds from creating the DataFrame and calculate the correlation. Solution 2 takes about 6 minutes to return the result. My guess is that the new creation of a DataFrame needs heavy calculations and so the time gets summed up for my dataset.

Is there a more performant way to calculate this result? Pandas corr also has to do this in the background to handle my None values, so there has to be a better solution.

My testing dataset has 500 rows with 550 values each. And as I said also have None values.


Solution

  • Solving your problem requires both math and programming. Since df.corr returns pretty quickly in your case, I will focus on the calculation of p-value.

    Programming

    scipy.stats.pearsonr(col_x, col_y) does not like dealing with NaN. So for each column pair, you have to remove all rows where one or both elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.

    If you view its source code, DataFrame.corr does it so blisteringly fast for 2 reasons:

    • It's coded in Cython and hence ran outside the Global Interpreter Lock (GIL). That means the looping is in bare-metal C.
    • It uss Welford's methodand does not rely on scipy.stats. The complexity of this algorithm is O(n * m^2) where n is the number of rows and m is the number of columns.

    Math

    The pearsonr documentation provides a description on how the p-value is calculated:

    r = <Pearson correlation coefficient>
    dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
    p = 2 * dist.cdf(-abs(r))
    

    Wikipedia gives us the formula for the CDF of the Beta distribution:

    cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
                        = scipy.special.betainc(alpha, beta, x)
    

    Fortunately for us, the betainc function is vectorized so if we pass in 3 arrays of the same length as alpha, beta and x, it will return an array as the output.


    Solution 1

    This solution is in native Python and gives a reasonable performance on your dataset (500 * 550). Took about 30 seconds on my 2014 iMac with 16GB of RAM:

    import scipy.special
    
    def corr1(df):
        mask = df.notna().to_numpy()
        corr = df.corr().to_numpy()
        n_rows, n_cols = df.shape
    
        # Initialize the return arrays for better performance
        length = int(n_cols * (n_cols - 1) / 2)
        idx = np.empty((length, 2), dtype=object)
        correl = np.empty(length, dtype=np.float64)
        count = np.empty(length, dtype=np.uint64)
    
        # For 2-combination of columns, let `n` be the number of pairs whose
        # elements are all non-NaN. We will need that later to calculate the
        # p-value
        k = -1
        for i in range(n_cols):
            for j in range(i):
                n = 0
                for row in range(n_rows):
                    n += 1 if mask[row, i] and mask[row, j] else 0
    
                k += 1
                idx[k] = (i, j)
                correl[k] = corr[i,j]
                count[k] = n
    
        # The p-value can be obtained with the incomplete Beta function (betainc)
        # We just need to massage the inputs a little
        alpha = count / 2 - 1
        x = (correl + 1) / 2
        x = np.where(correl < 0, x, 1 - x)
        p = 2 * scipy.special.betainc(alpha, alpha, x)
        
        return idx, correl, p
    
    # Manipulate the return values into the right format
    index, corr, p = corr1(df)
    
    idx = pd.MultiIndex.from_tuples(
        [(df.columns[i], df.columns[j]) for i, j in index] +
        [(df.columns[j], df.columns[i]) for i, j in index]
    )
    full_index = pd.MultiIndex.from_product([df.columns, df.columns])
    result = pd.DataFrame({
        'corr': np.tile(corr, 2),
        'p': np.tile(p, 2)
    }, index=idx).reindex(full_index).unstack()
    

    Solution 2

    For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. I am sure further optimization is possible, but I'm too lazy to explore them. The trade-off is a more complex build and deployment process.

    First, make sure you have a C compiler. Then install the Cython package:

    pip install cython
    

    Next, make 2 files: setup.py and utility.pyx:

    #################################################
    # setup.py
    #################################################
    from distutils.core import setup, Extension
    from Cython.Build import cythonize
    import numpy
    
    compiler_directives = {
        'language_level': '3'
    }
    setup(
        ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
        include_dirs=[numpy.get_include()]
    )
    
    #################################################
    # utility.pyx
    #################################################
    import cython
    from cython import Py_ssize_t
    
    import numpy as np
    from numpy cimport (
        ndarray,
        float64_t,
        uint8_t,
        uint64_t,
    )
    
    import scipy.special
    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def corr2(object df):
        # These variables go into the `nogil` context (i.e. into C land) so they
        # must be statically typed
        cdef:
            Py_ssize_t n_rows, n_cols, i, j, row, n, k
            ndarray[uint8_t, ndim=2] mask
            ndarray[float64_t, ndim=2] corr
            #
            ndarray[uint64_t, ndim=2] idx
            ndarray[float64_t, ndim=1] correl
            ndarray[uint64_t, ndim=1] count
    
        # We are still in Python land and thus have full access to all functions in
        # numpy and pandas. Converting pandas dataframes to a 2D numpy array
        # gives a huge speed boost
        mask = df.notna().to_numpy(dtype='uint8')
        corr = df.corr().to_numpy()
        n_rows, n_cols = df.shape
    
        # Initialize the return arrays
        length = int(n_cols * (n_cols - 1) / 2)
        idx = np.empty((length, 2), dtype=np.uint64)
        correl = np.empty(length, dtype=np.float64)
        count = np.empty(length, dtype=np.uint64)
    
        with nogil:
            # It's C-land in here. Everything must be statically defined
    
            k = -1
            # For 2-combination of columns, let `n` be the number of pairs whose
            # elements are all non-NaN. We will need that later to calculate the
            # p-value
            for i in range(n_cols):
                for j in range(i):
                    n = 0
                    for row in range(n_rows):
                        n += 1 if mask[row, i] and mask[row, j] else 0
                
                    k += 1
                    idx[k, 0] = i
                    idx[k, 1] = j
                    correl[k] = corr[i,j]
                    count[k] = n
        
        # Back to Python-land
        # The p-value can be obtained with the incomplete Beta function (betainc)
        # We just need to massage the input a little
        alpha = count / 2 - 1
        x = (correl + 1) / 2                # since -1 <= correl <= 1, this makes 0 <= x <= 1
        x = np.where(correl < 0, x, 1 - x)  # don't ask me why. It's half-wrong and half-right without this line
        p = 2 * scipy.special.betainc(alpha, alpha, x)
        return idx, correl, p
    

    Next, compile the utility.pyx into machine code:

    python setup.py build_ext --inplace
    

    Then you can use utility just like any other Python module:

    from utility import corr2
    
    index, corr, p = corr2(df)
    
    idx = pd.MultiIndex.from_tuples(
        [(df.columns[i], df.columns[j]) for i, j in index] +
        [(df.columns[j], df.columns[i]) for i, j in index]
    )
    full_index = pd.MultiIndex.from_product([df.columns, df.columns])
    result = pd.DataFrame({
        'corr': np.tile(corr, 2),
        'p': np.tile(p, 2)
    }, index=idx).reindex(full_index).unstack()