Search code examples
pythonscipyp-value

How SciPy calculates the p-value in pearsonr() function?


I have searched a lot but there is no explanation on how SciPy calculates the p-value for the correlation coefficient and why it is unreliable (started by SciPy on the function page) for data sets smaller than 500.


Solution

  • scipy.stats.pearsonr computes the p value using the t distribution. (You can check the source code in the file stats.py on github.) This should definitely be mentioned in the docstring.

    Here's an example. First, import pearsonr and scipy's implementation of the t distribution:

    In [334]: from scipy.stats import pearsonr, t as tdist
    

    Define x and y for this example:

    In [335]: x = np.array([0, 1, 2, 3, 5, 8, 13])
    
    In [336]: y = np.array([1.2, 1.4, 1.6, 1.7, 2.0, 4.1, 6.6])
    

    Compute r and p for this data:

    In [337]: r, p = pearsonr(x, y)
    
    In [338]: r
    Out[338]: 0.9739566302403544
    
    In [339]: p
    Out[339]: 0.0002073053505382502
    

    Now compute p again, by first computing the t statistic, and then finding twice the survival function for that t value:

    In [340]: df = len(x) - 2
    
    In [341]: t = r * np.sqrt(df/(1 - r**2))
    
    In [342]: 2*tdist.sf(t, df)  # This is the p value.
    Out[342]: 0.0002073053505382502
    

    We get the same p value, as expected.

    I don't know the source of the statement "The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so". If anyone knows a citable reference, it should be added to the pearsonr docstring.