Search code examples
rregressionlinear-regressionlmleast-squares

Solving normal equation gives different coefficients from using `lm`?


I wanted to compute a simple regression using the lm and plain matrix algebra. However, my regression coefficients obtained from matrix algebra are only half of those obtained from using the lm and I have no clue why.

Here's the code

boot_example <- data.frame(
  x1 = c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L),
  x2 = c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L),
  x3 = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L),
  x4 = c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L),
  x5 = c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L),
  x6 = c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L),
  preference_rating = c(9L, 7L, 5L, 6L, 5L, 6L, 5L, 7L, 6L)
  )
dummy_regression <- lm("preference_rating ~ x1+x2+x3+x4+x5+x6", data = boot_example)
dummy_regression

Call:
lm(formula = "preference_rating ~ x1+x2+x3+x4+x5+x6", data = boot_example)

Coefficients:
(Intercept)           x1           x2           x3           x4           x5           x6  
     4.2222       1.0000      -0.3333       1.0000       0.6667       2.3333       1.3333 

###The same by matrix algebra
X <- matrix(c(
c(1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L), #upper var
c(0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L), #upper var
c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), #country var
c(0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), #country var
c(1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L), #price var
c(0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L) #price var
), 
nrow = 9, ncol=6)

Y <- c(9L, 7L, 5L, 6L, 5L, 6L, 5L, 7L, 6L)

#Using standardized (mean=0, std=1) "z" -transformation Z = (X-mean(X))/sd(X) for all predictors
X_std <- apply(X, MARGIN = 2, FUN = function(x){(x-mean(x))/sd(x)})

##If constant shall be computed as well, uncomment next line 
#X_std <- cbind(c(rep(1,9)),X_std)

#using matrix algebra formula
solve(t(X_std) %*% X_std) %*% (t(X_std) %*% Y)

           [,1]
[1,]  0.5000000
[2,] -0.1666667
[3,]  0.5000000
[4,]  0.3333333
[5,]  1.1666667
[6,]  0.6666667

Does anyone see the error in my matrix computation?

Thank you in advance!


Solution

  • lm is not performing standardization. If you want to obtain the same result by lm, you need:

    X1 <- cbind(1, X)  ## include intercept
    
    solve(crossprod(X1), crossprod(X1,Y))
    
    #           [,1]
    #[1,]  4.2222222
    #[2,]  1.0000000
    #[3,] -0.3333333
    #[4,]  1.0000000
    #[5,]  0.6666667
    #[6,]  2.3333333
    #[7,]  1.3333333
    

    I don't want to repeat that we should use crossprod. See the "follow-up" part of Ridge regression with glmnet gives different coefficients than what I compute by “textbook definition”?