Search code examples
scipyp-valuehypothesis-testpearson-correlationnumpy-random

How to find if two arrays are from uncorrelated distributions?


I have two arrays - array1 and array2. How confidently can I state that they are generated from uncorrelated distributions? Just checking the correlation values would not suffice; I would require p-values as well.

Thought of using the following method -

import numpy as np
from scipy.stats import pearsonr

array1 = np.array([1, 2, 3, 4, 5])
array2 = np.array([5, 4, 3, 2, 1])

# Calculate the Pearson correlation coefficient and p-value
res = pearsonr(array1, array2, alternative='two-sided')

correlation_coefficient, p_value = res.statistic, res.pvalue

# Print the results
print(f"Pearson Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

# Define the confidence level (e.g., 95%)
confidence_level = 0.95

# Make a conclusion based on the p-value
if p_value < (1 - confidence_level):
    print(f"I am {confidence_level * 100}% confident that these two arrays are correlated.")

The output -

Pearson Correlation Coefficient: -1.0
P-value: 0.0
I am 95.0% confident that these two arrays are correlated.

However one problem with this approach is found in the docstring statement -

This function also performs a test of the null hypothesis that the distributions underlying the samples are uncorrelated and normally distributed.

I just want to test the hypothesis that the distributions underlying the samples are uncorrelated without caring about the nature of the distributions - normal or anything else.

Is there any alternative that solves this issue?


Two tentative solutions

Simulate the distribution of uncorrelated arrays

I can generate two random arrays (of same length as array1) and calculate their correlation. By repeating this many times I can get a distribution of the correlation values amongst uncorrelated arrays, without any reference to the underlying distribution type. I can use this distribution to get the p-value of the actual observed correlation between array1 and array2.

Of course the distribution followed by numpy random generators might have an impact. Besides, I want to avoid so much coding if a solution is already present.

Can I use method parameter of scipy.stats.pearsonr?

Am not sure what method parameter of pearsonr does. Particularly when method is set to PermutationMethod. Is that somehow a solution to my query?


Solution

  • Start by calculating the observed value of the statistic:

    import numpy as np
    from scipy import stats
    
    array1 = np.array([1, 2, 3, 4, 5])
    array2 = np.array([5, 4, 3, 2, 1])
    
    # Calculate the observed value of the statistic
    res = stats.pearsonr(array1, array2, alternative='two-sided')
    observed = res.statistic  # observed value of the statistic 
    observed  # -1.0
    

    This seems pretty extreme, but we need to quantify the probability of obtaining such an extreme value due to chance. A permutation test does this without making an assumption about the forms of the distributions from which the data were sampled.

    If the samples are from independent distributions, any value observed in the second array was equally likely to have been paired with any value observed in the first array. Accordingly, the null hypothesis of the permutation test is that all possible pairings were equally likely, and the observed pairings were sampled at random. By computing the statistic for each permutation of the second array w.r.t. the first, we obtain the null distribution of the statistic.

    import itertools
    rng = np.random.default_rng(4259845268735019)
    null_distribution = []
    for resample in itertools.permutations(array2):
        res = stats.pearsonr(array1, resample)
        null_distribution.append(res.statistic)
    null_distribution = np.asarray(null_distribution)
    

    The (one-sided) p-value corresponding with the alternative that distributions are dependent with positive/negative correlation is the fraction of the null distribution greater/less than or equal to the observed statistic value. The two-sided p-value is twice the minimum of the two.

    pvalue_greater = (null_distribution >= observed).sum() / len(null_distribution)
    pvalue_less = (null_distribution <= observed).sum() / len(null_distribution)
    pvalue = 2 * min(pvalue_greater, pvalue_less)
    pvalue  # 0.016666666666666666
    

    The null distribution happens to be symmetric about 0, so you may prefer to think of the two-sided p-value the fraction of statistic values "more extreme" (in either direction) than the observed value of the statistic.

    (abs(null_distribution) >= abs(observed)).sum() / len(null_distribution)
    # 0.016666666666666666
    

    This is essentially what pearsonr does under the hood when you pass method='PermutationMethod'.

    res = stats.pearsonr(array1, array2, alternative='two-sided', method=stats.PermutationMethod())
    res.pvalue  # 0.016666666666666666
    

    As the arrays get larger, the number of permutations grows quickly. By default, 'PermutationMethod' does no more than 9999 resamples, opting for randomized permutations if this is not exhaustive. You can change this using the n_resamples parameter of PermutationMethod.

    Consider avoiding a statement like "I am 95.0% confident that these two arrays are correlated" and instead reporting the full p-value and how you obtained it.