I want to create a correlation of my data with its p-Values. Currently I am using Pandas
with its corr
method on a DataFrame
. The problem is that this correlation method doesn't provide the p-Values.
So I tried to use two answers to this question: Stackoverflow Question. Both solutions use the scipy.stats.pearsonr
method for the calculation. I can't use this solution (Solution 1) because it removes to much of my dataset. My next try was this one (Solution 2). It gets the result I want but takes a huge amount of time.
In comparison: My pandas only correlation takes about 4 seconds from creating the DataFrame
and calculate the correlation. Solution 2 takes about 6 minutes to return the result. My guess is that the new creation of a DataFrame
needs heavy calculations and so the time gets summed up for my dataset.
Is there a more performant way to calculate this result? Pandas corr
also has to do this in the background to handle my None
values, so there has to be a better solution.
My testing dataset has 500 rows with 550 values each. And as I said also have None
values.
Solving your problem requires both math and programming. Since df.corr
returns pretty quickly in your case, I will focus on the calculation of p-value.
scipy.stats.pearsonr(col_x, col_y)
does not like dealing with NaN. So for each column pair, you have to remove all rows where one or both elements are NaN. You have 550 columns, hence 550 * 549 / 2 = 150,975 pairs. You better make sure your loop is extremely efficient.
If you view its source code, DataFrame.corr
does it so blisteringly fast for 2 reasons:
scipy.stats
. The complexity of this algorithm is O(n * m^2)
where n
is the number of rows and m
is the number of columns.The pearsonr
documentation provides a description on how the p-value is calculated:
r = <Pearson correlation coefficient>
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
p = 2 * dist.cdf(-abs(r))
Wikipedia gives us the formula for the CDF of the Beta distribution:
cdf(x, alpha, beta) = B(x, alpha, beta) / B(alpha, beta)
= scipy.special.betainc(alpha, beta, x)
Fortunately for us, the betainc
function is vectorized so if we pass in 3 arrays of the same length as alpha
, beta
and x
, it will return an array as the output.
This solution is in native Python and gives a reasonable performance on your dataset (500 * 550). Took about 30 seconds on my 2014 iMac with 16GB of RAM:
import scipy.special
def corr1(df):
mask = df.notna().to_numpy()
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays for better performance
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=object)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
k = -1
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k] = (i, j)
correl[k] = corr[i,j]
count[k] = n
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the inputs a little
alpha = count / 2 - 1
x = (correl + 1) / 2
x = np.where(correl < 0, x, 1 - x)
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
# Manipulate the return values into the right format
index, corr, p = corr1(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()
For the absolutely fastest solution, you would have to write it in Cython. This brings down the execution time from 30 to 5 seconds. I am sure further optimization is possible, but I'm too lazy to explore them. The trade-off is a more complex build and deployment process.
First, make sure you have a C compiler. Then install the Cython package:
pip install cython
Next, make 2 files: setup.py
and utility.pyx
:
#################################################
# setup.py
#################################################
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy
compiler_directives = {
'language_level': '3'
}
setup(
ext_modules=cythonize("utility.pyx", compiler_directives=compiler_directives),
include_dirs=[numpy.get_include()]
)
#################################################
# utility.pyx
#################################################
import cython
from cython import Py_ssize_t
import numpy as np
from numpy cimport (
ndarray,
float64_t,
uint8_t,
uint64_t,
)
import scipy.special
@cython.boundscheck(False)
@cython.wraparound(False)
def corr2(object df):
# These variables go into the `nogil` context (i.e. into C land) so they
# must be statically typed
cdef:
Py_ssize_t n_rows, n_cols, i, j, row, n, k
ndarray[uint8_t, ndim=2] mask
ndarray[float64_t, ndim=2] corr
#
ndarray[uint64_t, ndim=2] idx
ndarray[float64_t, ndim=1] correl
ndarray[uint64_t, ndim=1] count
# We are still in Python land and thus have full access to all functions in
# numpy and pandas. Converting pandas dataframes to a 2D numpy array
# gives a huge speed boost
mask = df.notna().to_numpy(dtype='uint8')
corr = df.corr().to_numpy()
n_rows, n_cols = df.shape
# Initialize the return arrays
length = int(n_cols * (n_cols - 1) / 2)
idx = np.empty((length, 2), dtype=np.uint64)
correl = np.empty(length, dtype=np.float64)
count = np.empty(length, dtype=np.uint64)
with nogil:
# It's C-land in here. Everything must be statically defined
k = -1
# For 2-combination of columns, let `n` be the number of pairs whose
# elements are all non-NaN. We will need that later to calculate the
# p-value
for i in range(n_cols):
for j in range(i):
n = 0
for row in range(n_rows):
n += 1 if mask[row, i] and mask[row, j] else 0
k += 1
idx[k, 0] = i
idx[k, 1] = j
correl[k] = corr[i,j]
count[k] = n
# Back to Python-land
# The p-value can be obtained with the incomplete Beta function (betainc)
# We just need to massage the input a little
alpha = count / 2 - 1
x = (correl + 1) / 2 # since -1 <= correl <= 1, this makes 0 <= x <= 1
x = np.where(correl < 0, x, 1 - x) # don't ask me why. It's half-wrong and half-right without this line
p = 2 * scipy.special.betainc(alpha, alpha, x)
return idx, correl, p
Next, compile the utility.pyx
into machine code:
python setup.py build_ext --inplace
Then you can use utility
just like any other Python module:
from utility import corr2
index, corr, p = corr2(df)
idx = pd.MultiIndex.from_tuples(
[(df.columns[i], df.columns[j]) for i, j in index] +
[(df.columns[j], df.columns[i]) for i, j in index]
)
full_index = pd.MultiIndex.from_product([df.columns, df.columns])
result = pd.DataFrame({
'corr': np.tile(corr, 2),
'p': np.tile(p, 2)
}, index=idx).reindex(full_index).unstack()