Search code examples
rlinear-regression

inconsistent R-squared values in lm()


I am fitting the same linear model in two different ways, resulting in the same parameter estimates but differing R-squared values. Where does the difference come from? Is this a bug in R? Here is my code:

m1 <- lm(stack.loss ~ ., data = stackloss)
summary(m1)

X <- model.matrix(m1)
y <- stackloss$stack.loss
m2 <- lm(y ~ 0 + X)
summary(m2)

The output for m1 is the following (slightly shortened):

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
Air.Flow      0.7156     0.1349   5.307  5.8e-05 ***
Water.Temp    1.2953     0.3680   3.520  0.00263 ** 
Acid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09

The output for m2 is has the same estimates for coefficients and residual standard error, but different R-squared values and different F-statistic:

             Estimate Std. Error t value Pr(>|t|)    
X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    

Residual standard error: 3.243 on 17 degrees of freedom
Multiple R-squared:  0.979, Adjusted R-squared:  0.9741 
F-statistic: 198.2 on 4 and 17 DF,  p-value: 5.098e-14

Why are the R-squared values different?


Solution

  • This is discussed in this post and also this. Here's a break down of what is happening in the lm() source code. The relevant part:

    r <- z$residuals
    f <- z$fitted.values
    w <- z$weights
    if (is.null(w)) {
            mss <- if (attr(z$terms, "intercept"))
                sum((f - mean(f))^2) else sum(f^2)
            rss <- sum(r^2)
    }
    

    Although you included an intercept, the attributes of the terms are not set to include an intercept, compare:

    attr(m1$terms,"intercept")
    [1] 1
    
    attr(m2$terms,"intercept")
    [1] 0
    

    I do not advise doing this, because you can easily use the formula interface to fit the model, without providing the model matrix yourself. But you can see by changing the attribute, you can get summary.lm to use the correct rss and get the correct r-squared:

    attr(m2$terms,"intercept") = 1
    Call:
    lm(formula = y ~ 0 + X)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -7.2377 -1.7117 -0.4551  2.3614  5.6978 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    X(Intercept) -39.9197    11.8960  -3.356  0.00375 ** 
    XAir.Flow      0.7156     0.1349   5.307  5.8e-05 ***
    XWater.Temp    1.2953     0.3680   3.520  0.00263 ** 
    XAcid.Conc.   -0.1521     0.1563  -0.973  0.34405    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 3.243 on 17 degrees of freedom
    Multiple R-squared:  0.9136,    Adjusted R-squared:  0.8983 
    F-statistic:  59.9 on 3 and 17 DF,  p-value: 3.016e-09