I have searched a lot but there is no explanation on how SciPy calculates the p-value for the correlation coefficient and why it is unreliable (started by SciPy on the function page) for data sets smaller than 500.
scipy.stats.pearsonr
computes the p value using the t distribution. (You can check the source code in the file stats.py
on github.) This should definitely be mentioned in the docstring.
Here's an example. First, import pearsonr
and scipy's implementation of the t distribution:
In [334]: from scipy.stats import pearsonr, t as tdist
Define x
and y
for this example:
In [335]: x = np.array([0, 1, 2, 3, 5, 8, 13])
In [336]: y = np.array([1.2, 1.4, 1.6, 1.7, 2.0, 4.1, 6.6])
Compute r
and p
for this data:
In [337]: r, p = pearsonr(x, y)
In [338]: r
Out[338]: 0.9739566302403544
In [339]: p
Out[339]: 0.0002073053505382502
Now compute p
again, by first computing the t statistic, and then finding twice the survival function for that t value:
In [340]: df = len(x) - 2
In [341]: t = r * np.sqrt(df/(1 - r**2))
In [342]: 2*tdist.sf(t, df) # This is the p value.
Out[342]: 0.0002073053505382502
We get the same p value, as expected.
I don't know the source of the statement "The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so". If anyone knows a citable reference, it should be added to the pearsonr
docstring.