Ordinary least squares with glmnet and lm

This question was asked in stackoverflow.com/q/38378118 but there was no satisfactory answer.

LASSO with λ = 0 is equivalent to ordinary least squares, but this does not seem to be the case for glmnet() and lm() in R. Why?

library(glmnet)
options(scipen = 999)

X = model.matrix(mpg ~ 0 + ., data = mtcars)
y = as.matrix(mtcars["mpg"])
coef(glmnet(X, y, lambda = 0))
lm(y ~ X)

Their regression coefficients agree by at most 2 significant figures, perhaps due to slightly different termination conditions of their optimization algorithms:

                  glmnet        lm
(Intercept)  12.19850081  12.30337
cyl          -0.09882217  -0.11144
disp          0.01307841   0.01334
hp           -0.02142912  -0.02148
drat          0.79812453   0.78711
wt           -3.68926778  -3.71530
qsec          0.81769993   0.82104
vs            0.32109677   0.31776
am            2.51824708   2.52023
gear          0.66755681   0.65541
carb         -0.21040602  -0.19942

The difference is much worse when we add interaction terms.

X = model.matrix(mpg ~ 0 + . + . * disp, data = mtcars)
y = as.matrix(mtcars["mpg"])
coef(glmnet(X, y, lambda = 0))
lm(y ~ X)

Regression coefficients:

                     glmnet           lm
(Intercept)   36.2518682237  139.9814651
cyl          -11.9551206007  -26.0246050
disp          -0.2871942149   -0.9463428
hp            -0.1974440651   -0.2620506
drat          -4.0209186383  -10.2504428
wt             1.3612184380    5.4853015
qsec           2.3549189212    1.7690334
vs           -25.7384282290  -47.5193122
am           -31.2845893123  -47.4801206
gear          21.1818220135   27.3869365
carb           4.3160891408    7.3669904
cyl:disp       0.0980253873    0.1907523
disp:hp        0.0006066105    0.0006556
disp:drat      0.0040336452    0.0321768
disp:wt       -0.0074546428   -0.0228644
disp:qsec     -0.0077317305   -0.0023756
disp:vs        0.2033046078    0.3636240
disp:am        0.2474491353    0.3762699
disp:gear     -0.1361486900   -0.1963693
disp:carb     -0.0156863933   -0.0188304

Solution

If you check out these two posts, you will get a sense as to why you are not getting the same results.

In essence, glmnet penalized maximum likelihood using a regularization path to estimate the model. lm solves the least squares problem using QR decomposition. So the estimates will never be exactly the same.

However, note in the manual for ?glmnet under "lambda":

WARNING: use with care. Do not supply a single value for lambda (for predictions after CV use predict() instead). Supply instead a decreasing sequence of lambda values. glmnet relies on its warms starts for speed, and its often faster to fit a whole path than compute a single fit.

You can do (at least) three things to get the coefficients closer so the difference is trivial--(1) have a range of values for lambda, (2) decrease the threshold value thres, and (3) increase the max number of iterations.

library(glmnet)
options(scipen = 999)

X = model.matrix(mpg ~ 0 + ., data = mtcars)
y = as.matrix(mtcars["mpg"])
lfit <- glmnet(X, y, lambda = rev(0:99), thres = 1E-10)
lmfit <- lm(y ~ X)
coef(lfit, s = 0) - coef(lmfit)
11 x 1 Matrix of class "dgeMatrix"
                          1
(Intercept)  0.004293053125
cyl         -0.000361655351
disp        -0.000002631747
hp           0.000006447138
drat        -0.000065394578
wt           0.000180943607
qsec        -0.000079480187
vs          -0.000462099248
am          -0.000248796353
gear        -0.000222035415
carb        -0.000071164178

X = model.matrix(mpg ~ 0 + . + . * disp, data = mtcars)
y = as.matrix(mtcars["mpg"])
lfit <- glmnet(X, y, lambda = rev(0:99), thres = 1E-12, maxit = 10^7)
lmfit <- glm(y ~ X)
coef(lfit, s = 0) - coef(lmfit)
 20 x 1 Matrix of class "dgeMatrix"
                           1
(Intercept) -0.3174019115228
cyl          0.0414909318817
disp         0.0020032493403
hp           0.0001834076765
drat         0.0188376047769
wt          -0.0120601219002
qsec         0.0019991131315
vs           0.0636756040430
am           0.0439343002375
gear        -0.0161102501755
carb        -0.0088921918062
cyl:disp    -0.0002714213271
disp:hp     -0.0000001211365
disp:drat   -0.0000859742667
disp:wt      0.0000462418947
disp:qsec   -0.0000175276420
disp:vs     -0.0004657059892
disp:am     -0.0003517289096
disp:gear    0.0001629963377
disp:carb    0.0000085312911

Some of the differences for the interacted model are probably non-trivial, but closer.