I have 2 numpy arrays ike so:
a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49,
28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])
b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
33.081223, 30.312504])
When I calculate the R-squared using SciKit Learn I get a completely different value than when I calculate Pearson's Correlation and then square the result:
sk_r2 = sklearn.metrics.r2_score(a, b)
print('SciKit R2: {:0.5f}\n'.format(sk_r2))
pearson_r = scipy.stats.pearsonr(a, b)
print('Pearson R: ', pearson_r)
print('Pearson R squared: ', pearson_r[0]**2)
Results in:
SciKit R2: 0.15913
Pearson R: (0.7617075766854164, 9.534162339384296e-05)
Pearson R squared: 0.5801984323799696
I realize that an R-squared value can sometimes be negative for a poorly fitting model (https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative) and therefore the square of Pearson's correlation is not always equal to R-squared. However, I thought that for a positive R-squared value it was always equal to Pearson's correlation squared? How are these R-squared values so different?
Pearson correlation coefficient R and R-squared coefficient of determination are two completely different statistics.
You can take a look at https://en.wikipedia.org/wiki/Pearson_correlation_coefficient and https://en.wikipedia.org/wiki/Coefficient_of_determination
Persons's r coefficient is a measure of linear correlation between two variables and is
where bar x
and bar y
are the means of the samples.
R2 coefficient of determination is a measure of goodness of fit and is
where hat y
is the predicted value of y
and bar y
is the mean of the sample.
Thus
r**2
is not equal to R2
because their formula are totally differentr**2
is equal to R2
only in the case that you calculate r
with a variable (say y
) and the predicted variable hat y
from a linear model
Let's make an example using the two arrays you provided
import numpy as np
import pandas as pd
import scipy.stats as sps
import statsmodels.api as sm
from sklearn.metrics import r2_score as R2
import matplotlib.pyplot as plt
a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49,
28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])
b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
33.081223, 30.312504])
df = pd.DataFrame({
'x': a,
'y': b,
})
df.plot(x='x', y='y', marker='.', ls='none', legend=False);
now we fit a linear regression model
mod = sm.OLS.from_formula('y ~ x', data=df)
mod_fit = mod.fit()
print(mod_fit.summary())
output
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.580
Model: OLS Adj. R-squared: 0.557
Method: Least Squares F-statistic: 24.88
Date: Mon, 29 Mar 2021 Prob (F-statistic): 9.53e-05
Time: 14:12:15 Log-Likelihood: -36.562
No. Observations: 20 AIC: 77.12
Df Residuals: 18 BIC: 79.12
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 16.0814 2.689 5.979 0.000 10.431 21.732
x 0.4157 0.083 4.988 0.000 0.241 0.591
==============================================================================
Omnibus: 6.882 Durbin-Watson: 3.001
Prob(Omnibus): 0.032 Jarque-Bera (JB): 4.363
Skew: 0.872 Prob(JB): 0.113
Kurtosis: 4.481 Cond. No. 245.
==============================================================================
and compute both r**2
and R2
and we can see that in this case they're equal
predicted_y = mod_fit.predict(df.x)
print("R2 :", R2(df.y, predicted_y))
print("r^2:", sps.pearsonr(df.y, predicted_y)[0]**2)
output
R2 : 0.5801984323799696
r^2: 0.5801984323799696
You did R2(df.x, df.y)
that can't be equal to our computed values because you used a measure of goodness-of-fit between independent x
and dependent y
variables. We instead used both r
and R2
with y
and predicted value of y
.