How SciPy calculates the p-value in pearsonr() function?

I have searched a lot but there is no explanation on how SciPy calculates the p-value for the correlation coefficient and why it is unreliable (started by SciPy on the function page) for data sets smaller than 500.

Solution

scipy.stats.pearsonr computes the p value using the t distribution. (You can check the source code in the file stats.py on github.) This should definitely be mentioned in the docstring.

Here's an example. First, import pearsonr and scipy's implementation of the t distribution:

In [334]: from scipy.stats import pearsonr, t as tdist

Define x and y for this example:

In [335]: x = np.array([0, 1, 2, 3, 5, 8, 13])

In [336]: y = np.array([1.2, 1.4, 1.6, 1.7, 2.0, 4.1, 6.6])

Compute r and p for this data:

In [337]: r, p = pearsonr(x, y)

In [338]: r
Out[338]: 0.9739566302403544

In [339]: p
Out[339]: 0.0002073053505382502

Now compute p again, by first computing the t statistic, and then finding twice the survival function for that t value:

In [340]: df = len(x) - 2

In [341]: t = r * np.sqrt(df/(1 - r**2))

In [342]: 2*tdist.sf(t, df)  # This is the p value.
Out[342]: 0.0002073053505382502

We get the same p value, as expected.

I don't know the source of the statement "The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so". If anyone knows a citable reference, it should be added to the pearsonr docstring.