Search code examples
pythonnumpystatisticscoefficient-of-determination

Why do coefficient of determination, R², implementations produce different results?


When attempting to implement a python function for calculating the coefficient of determination, R², I noticed I got wildly different results depending on whose calculation sequence I used.

The wikipedia page on R² gives a seemingly very clear explanation as to how R² should be calculated. My numpy interpretation of what is being said on the wiki page is the following:

def calcR2_wikipedia(y, yhat):
    # Mean value of the observed data y.
    y_mean = np.mean(y)
    # Total sum of squares.
    SS_tot = np.sum((y - y_mean)**2)
    # Residual sum of squares.
    SS_res = np.sum((y - yhat)**2)
    # Coefficient of determination.
    R2 = 1.0 - (SS_res / SS_tot)
    return R2

When I try this method with a target vector y and a vector of modeled estimates yhat, this function produces an R² value of -0.00301.

However, the accepted answer from this stackoverflow post discussing how to calculate R², gives the following definition:

def calcR2_stackOverflow(y, yhat):
    SST = np.sum((y - np.mean(y))**2)
    SSReg = np.sum((yhat - np.mean(y))**2)
    R2 = SSReg/SST
    return R2

Using that method with the same y and yhat vectors as before, I now get an R² of 0.319.

Moreover, in the same stackoverflow post, a lot people of seem to be in favor of calculating the R² with the scipy module like this:

import scipy
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(yhat, y)
R2 = r_value**2

Which in my case produces 0.261.

So my question is: why are R² values produced from seemingly well-accepted sources radically different from each other? And what is the correct way of calculating R² between two vectors?


Solution

  • Definitions

    This is a notation abuse that often leads to misunderstanding. You are comparing two different coefficients:

    • Coefficient of determination (usually noted as R^2) which can be used for any OLS regression not only linear regression (OLS is linear with regards of fit parameters not the function itself);
    • Pearson Correlation Coefficient (usually noted as r or r^2 when squared) which is used for linear regression only.

    If you read carefully the introduction of Coefficient of determination on Wikipedia page, you will see that it is discussed there, it starts as follow:

    There are several definitions of R2 that are only sometimes equivalent.

    MCVE

    You can confirm that classical implementation of those score return expected results:

    import numpy as np
    import scipy
    from sklearn import metrics
    
    np.random.seed(12345)
    x = np.linspace(-3, 3, 1001)
    yh = np.polynomial.polynomial.polyval(x, [1, 2])
    e = np.random.randn(x.size)
    yn = yh + e
    

    Then your function calcR2_wikipedia (0.9265536406736125) returns the coefficient of determination, it can be confirmed as it returns the same as sklearn.metrics.r2_score:

    metrics.r2_score(yn, yh) # 0.9265536406736125
    

    In the other hand the scipy.stats.linregress returns the correlation coefficient (valid for linear regression only):

    slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(yh, yn)
    r_value # 0.9625821384210018
    

    Which you can cross confirm by it's definition:

    C = np.cov(yh, yn)
    C[1,0]/np.sqrt(C[0,0]*C[1,1]) # 0.9625821384210017