Search code examples
rsummaryinterceptlm

Why does summary overestimate the R-squared with a "no-intercept" model formula


I wanted to make a simple linear model (lm()) without intercept coefficient so I put -1 in my model formula as in the following example. The problem is that the R-squared return by summary(myModel) seems to be overestimated. lm(), summary() and -1 are among the very classic function/functionality in R. Hence I am a bit surprised and I wonder if this is a bug or if there is any reason for this behaviour.

Here is an example:

x <- rnorm(1000, 3, 1)
mydf <- data.frame(x=x, y=1+x+rnorm(1000, 0, 1))
plot(y ~ x, mydf, xlim=c(-2, 10), ylim=c(-2, 10))

mylm1 <- lm(y ~ x, mydf)
mylm2 <- lm(y ~ x - 1, mydf)

abline(mylm1, col="blue") ; abline(mylm2, col="red")
abline(h=0, lty=2) ; abline(v=0, lty=2)

r2.1 <- 1 - var(residuals(mylm1))/var(mydf$y)
r2.2 <- 1 - var(residuals(mylm2))/var(mydf$y)
r2 <- c(paste0("Intercept - r2: ", format(summary(mylm1)$r.squared, digits=4)),
        paste0("Intercept - manual r2: ", format(r2.1, digits=4)),
        paste0("No intercept - r2: ", format(summary(mylm2)$r.squared, digits=4)),
        paste0("No intercept - manual r2: ", format(r2.2, digits=4)))
legend('bottomright', legend=r2, col=c(4,4,2,2), lty=1, cex=0.6)

enter image description here


Solution

  • Your formula is wrong. Here is what I do in fastLm.R in RcppArmadillo when computing the summary:

    ## cf src/library/stats/R/lm.R and case with no weights and an intercept 
    f <- object$fitted.values    
    r <- object$residuals         
    mss <- if (object$intercept) sum((f - mean(f))^2) else sum(f^2)     
    rss <- sum(r^2)   
    
    r.squared <- mss/(mss + rss) 
    df.int <- if (object$intercept) 1L else 0L    
    
    n <- length(f)     
    rdf <- object$df     
    adj.r.squared <- 1 - (1 - r.squared) * ((n - df.int)/rdf)      
    

    There are two places where you need to keep track of whether there is an intercept or not.