Search code examples
pythoncorrelationscipy.stats

Improve speed of SciPy Pearson correlation for many pairwise calculations?


I have a data frame with 1222 rows and 33,000 columns. I need to compute the pairwise correlation coefficients (and associated p-values) between the first 16,000 columns and the remaining columns. Currently, I am using scipy.stats.pearsonr like:

from scipy.stats import pearsonr


def correlation_analysis(lncRNA_PC_T):
    """Function for correlation analysis"""
    correlations = pd.DataFrame()
    for PC in [column for column in lncRNA_PC_T.columns if "_PC" in column]:
        for lncRNA in [
            column for column in lncRNA_PC_T.columns if "_lncRNAs" in column
        ]:
            correlations = correlations.append(
                pd.Series(
                    pearsonr(lncRNA_PC_T[PC], lncRNA_PC_T[lncRNA]),
                    index=["PCC", "p-value"],
                    name=PC + "_" + lncRNA,
                )
            )

    return correlations

The above code is doing its job; however, for my large data frame, it is taking more than 30 minutes to finish. How can I speed this up?


Solution

  • We can speed this up by over three order of magnitude on the CPU and a few additional orders of magnitude with a GPU.

    I generated some random data to use and structured it like your DataFrame. The size of the DataFrame is reduced so that your code runs in reasonable time.

    import numpy as np
    import pandas as pd
    
    rng = np.random.default_rng(4582345683546)
    
    m = 120
    n = 330
    nx = 160
    ny = n - nx
    
    data = rng.random((m, n))
    column_names = ([f'_PC{i}' for i in range(nx)] 
                    + [f'_lncRNAs{i}' for i in range(ny)])
    lncRNA_PC_T = pd.DataFrame(data, columns=column_names)
    

    The original correlation_analysis takes ~75 seconds to run on a regular Colab instance. The NumPy code below takes about 22 milliseconds, and the test at the end confirms that the resulting statistic and p-value arrays are the same as the columns of your DataFrame.

    import numpy as xp
    from scipy.special import betainc
    
    # import cupy as xp
    # from cupyx.scipy.special import betainc
    
    def pearsonr2(x, y):
        # Assumes inputs are DataFrames and computation is to be performed
        # pairwise between columns. We convert to arrays and reshape so calculation
        # is performed according to normal broadcasting rules along the last axis.
        x = xp.asarray(x).T[:, xp.newaxis, :]
        y = xp.asarray(y).T
        n = x.shape[-1]
    
        # Compute Pearson correlation coefficient. We can't use `cov` or `corrcoef`
        # because they want to compute everything pairwise between rows of a
        # stacked x and y.
        xm = x.mean(axis=-1, keepdims=True)
        ym = y.mean(axis=-1, keepdims=True)
        cov = xp.sum((x - xm) * (y - ym), axis=-1)/(n-1)
        sx = xp.std(x, ddof=1, axis=-1)
        sy = xp.std(y, ddof=1, axis=-1)
        rho = cov/(sx * sy)
    
        # Compute the two-sided p-values. See documentation of scipy.stats.pearsonr.
        ab = n/2 - 1
        x = (abs(rho) + 1)/2
        p = 2*(1-betainc(ab, ab, x))
        return rho, p
    
    df_corr = correlation_analysis(lncRNA_PC_T)
    %timeit correlation_analysis(lncRNA_PC_T)
    # 1min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    x = lncRNA_PC_T.iloc[:, :nx]
    y = lncRNA_PC_T.iloc[:, nx:]
    corr2, p2 = pearsonr2(x, y)
    %timeit pearsonr2(x, y)
    # 21.9 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # Check results
    np.testing.assert_allclose(corr2.ravel(), df_corr.iloc[:, 0])
    np.testing.assert_allclose(p2.ravel(), df_corr.iloc[:, 1])
    

    By commenting out the NumPy/SciPy imports and uncommenting the CuPy imports, you can speed things up even more.