Search code examples
pythonscikit-learnpearson-correlationscipy.statscoefficient-of-determination

SciKit Learn R-squared is very different from square of Pearson's Correlation R


I have 2 numpy arrays ike so:

a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49, 
              28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])

b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
              28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
              30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
              33.081223, 30.312504])

When I calculate the R-squared using SciKit Learn I get a completely different value than when I calculate Pearson's Correlation and then square the result:

sk_r2 = sklearn.metrics.r2_score(a, b)
print('SciKit R2: {:0.5f}\n'.format(sk_r2))

pearson_r = scipy.stats.pearsonr(a, b)
print('Pearson R: ', pearson_r)
print('Pearson R squared: ', pearson_r[0]**2)

Results in:
SciKit R2: 0.15913

Pearson R: (0.7617075766854164, 9.534162339384296e-05)
Pearson R squared: 0.5801984323799696

I realize that an R-squared value can sometimes be negative for a poorly fitting model (https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative) and therefore the square of Pearson's correlation is not always equal to R-squared. However, I thought that for a positive R-squared value it was always equal to Pearson's correlation squared? How are these R-squared values so different?


Solution

  • Pearson correlation coefficient R and R-squared coefficient of determination are two completely different statistics.

    You can take a look at https://en.wikipedia.org/wiki/Pearson_correlation_coefficient and https://en.wikipedia.org/wiki/Coefficient_of_determination


    update

    Persons's r coefficient is a measure of linear correlation between two variables and is

    enter image description here

    where bar x and bar y are the means of the samples.

    R2 coefficient of determination is a measure of goodness of fit and is

    enter image description here

    where hat y is the predicted value of y and bar y is the mean of the sample.

    Thus

    1. they measure different things
    2. r**2 is not equal to R2 because their formula are totally different

    update 2

    r**2 is equal to R2 only in the case that you calculate r with a variable (say y) and the predicted variable hat y from a linear model

    Let's make an example using the two arrays you provided

    import numpy as np
    import pandas as pd
    import scipy.stats as sps
    import statsmodels.api as sm
    from sklearn.metrics import r2_score as R2
    import matplotlib.pyplot as plt
    
    a = np.array([32.0, 25.97, 26.78, 35.85, 30.17, 29.87, 30.45, 31.93, 30.65, 35.49, 
                  28.3, 35.24, 35.98, 38.84, 27.97, 26.98, 25.98, 34.53, 40.39, 36.3])
    
    b = np.array([28.778585, 31.164268, 24.690865, 33.523693, 29.272448, 28.39742,
                  28.950092, 29.701189, 29.179174, 30.94298 , 26.05434 , 31.793175,
                  30.382706, 32.135723, 28.018875, 25.659306, 27.232124, 28.295502,
                  33.081223, 30.312504])
    
    df = pd.DataFrame({
        'x': a,
        'y': b,
    })
    
    df.plot(x='x', y='y', marker='.', ls='none', legend=False);
    

    enter image description here

    now we fit a linear regression model

    mod = sm.OLS.from_formula('y ~ x', data=df)
    mod_fit = mod.fit()
    print(mod_fit.summary())
    

    output

                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.580
    Model:                            OLS   Adj. R-squared:                  0.557
    Method:                 Least Squares   F-statistic:                     24.88
    Date:                Mon, 29 Mar 2021   Prob (F-statistic):           9.53e-05
    Time:                        14:12:15   Log-Likelihood:                -36.562
    No. Observations:                  20   AIC:                             77.12
    Df Residuals:                      18   BIC:                             79.12
    Df Model:                           1                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept     16.0814      2.689      5.979      0.000      10.431      21.732
    x              0.4157      0.083      4.988      0.000       0.241       0.591
    ==============================================================================
    Omnibus:                        6.882   Durbin-Watson:                   3.001
    Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.363
    Skew:                           0.872   Prob(JB):                        0.113
    Kurtosis:                       4.481   Cond. No.                         245.
    ==============================================================================
    

    and compute both r**2 and R2 and we can see that in this case they're equal

    predicted_y = mod_fit.predict(df.x)
    print("R2 :", R2(df.y, predicted_y))
    print("r^2:", sps.pearsonr(df.y, predicted_y)[0]**2)
    

    output

    R2 : 0.5801984323799696
    r^2: 0.5801984323799696
    

    You did R2(df.x, df.y) that can't be equal to our computed values because you used a measure of goodness-of-fit between independent x and dependent y variables. We instead used both r and R2 with y and predicted value of y.