r machine-learning glmnet lasso-regression

glmnet coefficients differ between versions (2.0.16 vs 3.0.2)

I manage an internal code base that relies heavily upon the glmnet package. Upon upgrading to the newest version (v3.0.2) my unit tests started failing for the coefficients of a Cox model. The previous version of glmnet was v2.0.16 (R 3.5.2). I am now running R v3.6.2.

I have noticed that there is a new relax = argument that appears to use un-regularized fits in the path and I'd imagine this could cause a slight difference in the fits, however the default is relax = FALSE, so I doubt that is the issue.

Below is a reprex based on the mtcars dataset, fitting 2 randomly chosen features and renaming two variables to time and status so as to allow fitting of a Cox model. A proper reprex comparison is difficult as it would require different R installations, but this should allow anyone to reproduce the issue.

library(magrittr)
library(dplyr)
library(glmnet)
dat <- mtcars %>%
    select(mpg, disp, status = vs, time = hp) %>%   # select 2 features; assign time & status
    mutate_at(1:2, ~ {
      log10(.x) %>% subtract(mean(.)) %>% divide_by(sd(.))   # center & scale
    }) %>% as.matrix()
glmnet(dat[, 1:2], dat[, 3:4], family = "cox", lambda = 0)$beta   # fit model

The result for v3.0.2 is:

#> 2 x 1 sparse Matrix of class "dgCMatrix"
#>              s0
#> mpg   0.2293535
#> disp -1.8160387

The result for v2.0.16 is:

#> 2 x 1 sparse Matrix of class "dgCMatrix"
#>              s0
#> mpg   0.2154324
#> disp -1.8172714

Have others noticed similar discrepancies? I am somewhat surprised not to have found anyone else bumping into this same issue. Am I going to have to update all my unit tests :(

Insights and/or explanations greatly appreciated. Thanks in advance.

Solution

Slightly too long for a comment:

I reproduced your results on Ubuntu 16.04 (using devtools::install_version(), see below).
2.0-16 to 3.0-2 spans several releases (and several more internally labeled, unreleased versions): the NEWS file makes several references to coxnet (presumably the internal function called for family="cox":
- 2.0-20:
  - Fixed a bug in internal function coxnet.deviance to do with input pred, as well as saturated loglike (missing) and weights
  - added a coxgrad function for computing the gradient
- 2.0-19: Fixed a bug in coxnet to do with ties between death set and risk set

I would suggest using

devtools::install_version("glmnet",version=...,lib=<version-specific>)

to install every version from 2.0-16 to 3.0-2 inclusive, each in a separate library, to make it easy (via library("glmnet", lib.loc=...) to load different package versions and bisect to find the specific change. (The intermediate versions were unreleased, so you'll be jumping from 2.0-18 to 3.0.)

I'm guessing that one of those coxnet bug fixes is (intentionally or as a side effect) responsible for the changes.

If it were in an accessible git repository you could use git bisect with a local copy to automate the process (maybe not worth it for such a small number of changepoints, but it doesn't look like the development tree is available: there's a nice pkgdown website but I don't see any links to a version control system.

If you have a lot of time on your hands you can download all of the archived tarballs and hunt through them for changes ...