python pandas performance scipy correlation

How to highly optimize correlation calculations using `pandas` DataFrames and `scipy`

I want to calculate pearson correlations of the columns of a pandas DataFrame. I don't just want to use DataFrame.corr() because I also need the pvalue of the correlation; therefore, I am using scipy.stats.pearsonr(x, y). My problem right now is that my dataframe is huge (shape: (1166, 49262)), so I'm looking at (49262^2-49262)/2 correlations.

Please advise on how I can optimize this to reduce the computation time.

My code for the correlation:

# the variable `data` contains the dataframe of shape (1166, 49262)

# setting up output dataframes
dfcols = pd.DataFrame(columns=data.columns)
correlation = dfcols.T.join(dfcols, how='outer')
pvalues = correlation.copy()
# pairwise calculation
for r in range(len(data.columns)):
    for c in range(r+1, len(data.columns)):
        # iterate over all combinations of columns to calculate correlation
        tmp = input.iloc[:, [r,c]].dropna()
        if len(tmp) < 2:
            # too few data points to calculate correlation coefficient
            result = (0, 1) 
        else:
            result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
        correlation.iloc[r, c] = result[0]
        pvalues.iloc[r, c] = result[1]

Some notes:

I am also open to suggestions for packages other than scipy; I just need pvalues for the correlations.
I think iterating over the columns by integer indexing instead of column name is faster; can someone confirm/deny this?
The data frame has a lot of missing data, so I .dropna() and catch cases where less than two data points remain.
Simply iterating over the dataframe and extracting the columns in pairs would take over 16.5 days. without even doing any calculations. (Extrapolated from the first 5 full passes using the following code)

def foo():
    data = load_df()        # the pd.DataFrame of shape (1166, 49262)
    cols = data.columns
    for i in range(len(cols)):
        logging.info(f"{i+1}/{len(cols)}")
        for j in range(i+1, len(cols)):
            tmp = data.iloc[:, [i, j]].dropna()
            if len(tmp) < 2:
                # You may ignore this for this post; I was looking for columns pairs with too few data points to correlate
                logging.warn(f"correlating columns '{cols[i]}' and '{cols[j]}' results in less than 2 usable data points")

foo()

I think I could use multithreading to at least use some more threads for correlation calculations.
In case someone might deem this important: The data I'm working with is a proteomic dataset with ~50,000 peptides and 1166 patients; I want to correlate the expression of the peptides over all patients in a pairwise fashion.

Solution

You can obtain about a 200x speedup by using pd.corr() plus converting the R values into a probability with a beta distribution.

I would suggest implementing this by looking at how SciPy did it and seeing if there are any improvements which are applicable to your case. The source code can be found here. This tells you exactly how they implemented the p-value. Specifically, they take a beta distribution with a = b = n / 2 - 1, running from -1 to 1, and find either the cumulative distribution function or the survival function of that distribution at the specified R value.

So while pearsonr() does not support being vectorized across all pairs of columns, the underlying beta distribution does support this. Using this, you can turn the correlation that pd.corr() gives you into a correlation plus a p-value.

I've checked this against your existing algorithm, and it agrees with it to within machine epsilon. I also tested it with NA values.

In terms of speed, it is roughly ~200x faster than your original solution, making faster than a multicore solution while only using one core.

Here is the code. Note that only calculate_corr_fast and get_pvalue_vectorized are important to the solution. The rest is just to set up test data or for comparison.

import pandas as pd
import numpy as np
from scipy.stats import pearsonr
import scipy

M = 1000
N = 200
P = 0.1
A = np.random.rand(M, N)
A[np.random.rand(M, N) < P] = np.nan
df = pd.DataFrame(A, columns=[f"a{i}" for i in range(1, N + 1)])

# setting up output dataframes
def calculate_corr(data):
    dfcols = pd.DataFrame(columns=data.columns)
    correlation = dfcols.T.join(dfcols, how='outer')
    pvalues = correlation.copy()
    # pairwise calculation
    for r in range(len(data.columns)):
        for c in range(r, len(data.columns)):
            # iterate over all combinations of columns to calculate correlation
            tmp = data.iloc[:, [r,c]].dropna()
            if len(tmp) < 2:
                # too few data points to calculate correlation coefficient
                result = (0, 1) 
            else:
                result = pearsonr(tmp.iloc[:, 0], tmp.iloc[:, 1])
            correlation.iloc[r, c] = result[0]
            correlation.iloc[c, r] = result[0]
            pvalues.iloc[r, c] = result[1]
            pvalues.iloc[c, r] = result[1]
    return correlation, pvalues


def get_pvalue_vectorized(r, ab, alternative='two-sided'):
    """Get p-value from beta dist given the statistic, and alternative."""
    assert len(r.shape) == 2
    assert len(ab.shape) == 2
    diag = np.arange(r.shape[0])
    # This is just to keep squareform happy. These don't actually
    # get sent to the beta distribution function.
    r[diag, diag] = 0
    ab[diag, diag] = 0
    # Avoid doing repeated computations of r,c and c,r
    rsq = scipy.spatial.distance.squareform(r)
    r[diag, diag] = 1
    absq = scipy.spatial.distance.squareform(ab)
    kwargs = dict(a=absq, b=absq, loc=-1, scale=2)

    if alternative == 'less':
        pvalue = scipy.stats.beta.cdf(rsq, **kwargs)
    elif alternative == 'greater':
        pvalue = scipy.stats.beta.sf(rsq, **kwargs)
    elif alternative == 'two-sided':
        pvalue = 2 * (scipy.stats.beta.sf(np.abs(rsq), **kwargs))
    else:
        message = "`alternative` must be 'less', 'greater', or 'two-sided'."
        raise ValueError(message)
    # Put back into 2d matrix
    pvalue = scipy.spatial.distance.squareform(pvalue)
    return pvalue


def calculate_corr_fast(data):
    correlation = data.corr()
    # For each pair of data values, count how many cases where both data values are
    # defined at the same position, using matrix multiply as a pairwise boolean and.
    data_notna = data.notna().values.astype('float32')
    value_count = data_notna.T @ data_notna
    # This is the a and b parameter for the beta distribution. It is different
    # for every p-value, because each one can potentially have a different number
    # of missing values
    ab = value_count / 2 - 1
    pvalues = get_pvalue_vectorized(correlation.values, ab)
    invalid = value_count < 2
    pvalues[invalid] = np.nan
    # Convert back to dataframe
    pvalues = pd.DataFrame(pvalues, columns=correlation.columns, index=correlation.index)
    return correlation, pvalues


correlation, pvalues = calculate_corr(df)
correlation_fast, pvalues_fast = calculate_corr_fast(df)
assert np.allclose(pvalues_fast.values.astype('float64'), pvalues.values.astype('float64'))
assert np.allclose(correlation_fast.values.astype('float64'), correlation.values.astype('float64'))

Benchmark results for 1000x200 dataframe:

Original code
40.5 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Andrej Kesely's answer
~20 seconds
My answer
190 ms ± 209 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Benchmark notes:

I used a dataframe with 1000 rows and 200 columns. I tried to use about the same number of rows so as to preserve the split of work between finding correlations and calculating p-value. I reduced the number of columns so the benchmark would finish in my lifetime. :)
My benchmarking system has 4 cores. Specifically, it is a Intel(R) Core(TM) i7-8850H CPU with fewer cores.