I have a data frame with 1222 rows and 33,000 columns. I need to compute the pairwise correlation coefficients (and associated p-values) between the first 16,000 columns and the remaining columns. Currently, I am using scipy.stats.pearsonr
like:
from scipy.stats import pearsonr
def correlation_analysis(lncRNA_PC_T):
"""Function for correlation analysis"""
correlations = pd.DataFrame()
for PC in [column for column in lncRNA_PC_T.columns if "_PC" in column]:
for lncRNA in [
column for column in lncRNA_PC_T.columns if "_lncRNAs" in column
]:
correlations = correlations.append(
pd.Series(
pearsonr(lncRNA_PC_T[PC], lncRNA_PC_T[lncRNA]),
index=["PCC", "p-value"],
name=PC + "_" + lncRNA,
)
)
return correlations
The above code is doing its job; however, for my large data frame, it is taking more than 30 minutes to finish. How can I speed this up?
We can speed this up by over three order of magnitude on the CPU and a few additional orders of magnitude with a GPU.
I generated some random data to use and structured it like your DataFrame. The size of the DataFrame is reduced so that your code runs in reasonable time.
import numpy as np
import pandas as pd
rng = np.random.default_rng(4582345683546)
m = 120
n = 330
nx = 160
ny = n - nx
data = rng.random((m, n))
column_names = ([f'_PC{i}' for i in range(nx)]
+ [f'_lncRNAs{i}' for i in range(ny)])
lncRNA_PC_T = pd.DataFrame(data, columns=column_names)
The original correlation_analysis
takes ~75 seconds to run on a regular Colab instance. The NumPy code below takes about 22 milliseconds, and the test at the end confirms that the resulting statistic and p-value arrays are the same as the columns of your DataFrame.
import numpy as xp
from scipy.special import betainc
# import cupy as xp
# from cupyx.scipy.special import betainc
def pearsonr2(x, y):
# Assumes inputs are DataFrames and computation is to be performed
# pairwise between columns. We convert to arrays and reshape so calculation
# is performed according to normal broadcasting rules along the last axis.
x = xp.asarray(x).T[:, xp.newaxis, :]
y = xp.asarray(y).T
n = x.shape[-1]
# Compute Pearson correlation coefficient. We can't use `cov` or `corrcoef`
# because they want to compute everything pairwise between rows of a
# stacked x and y.
xm = x.mean(axis=-1, keepdims=True)
ym = y.mean(axis=-1, keepdims=True)
cov = xp.sum((x - xm) * (y - ym), axis=-1)/(n-1)
sx = xp.std(x, ddof=1, axis=-1)
sy = xp.std(y, ddof=1, axis=-1)
rho = cov/(sx * sy)
# Compute the two-sided p-values. See documentation of scipy.stats.pearsonr.
ab = n/2 - 1
x = (abs(rho) + 1)/2
p = 2*(1-betainc(ab, ab, x))
return rho, p
df_corr = correlation_analysis(lncRNA_PC_T)
%timeit correlation_analysis(lncRNA_PC_T)
# 1min 14s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
x = lncRNA_PC_T.iloc[:, :nx]
y = lncRNA_PC_T.iloc[:, nx:]
corr2, p2 = pearsonr2(x, y)
%timeit pearsonr2(x, y)
# 21.9 ms ± 190 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Check results
np.testing.assert_allclose(corr2.ravel(), df_corr.iloc[:, 0])
np.testing.assert_allclose(p2.ravel(), df_corr.iloc[:, 1])
By commenting out the NumPy/SciPy imports and uncommenting the CuPy imports, you can speed things up even more.