r replication feature-selection lasso-regression reproducible-research

Inconsistent results trying to reproduce the variables selected by LASSO ran using enet(lambda = 0) with lars(type = "lasso")

The scripts which are referenced and from which the lines of code included came from can all be found at my GitHub Repository for this project, along with mini practice datasets (called ten, top 50, and last 50) so that you can replicate my results.

I am attempting to verify that the variables selected by n LASSO Regressions I have run on n datasets are valid, i.e. are reproducible. In my research project, I have run a LASSO on each of 260k synthetic datasets which were randomly generated via Monte Carlo Methods and have calculated several performance metrics for each of their selections. All of these LASSOs were run via the enet() function from the elastic net package in R. Now, I am attempting to reproduce my results using a different function from a different package in R (but setting the same random seed beforehand of course), and for this task, I have chosen to use the lars() function from the lars package (as a 2nd choice because I have already spent much time trying and failing to use the glmnet package to do this which I asked for help with on Stack Overflow here, but no one answered).

This is the code I used to run my original 260k LASSOs, it can be found in the "LASSO Regressions" script, the "LASSO Regressions (practice version)" script, and also in the "LASSO code, but just the regression part" script.

set.seed(11)
system.time(LASSO_fits <- lapply(X = datasets, function(i) 
               elasticnet::enet(x = as.matrix(dplyr::select(i, 
                                         starts_with("X"))), 
               y = i$Y, lambda = 0, normalize = FALSE)))

Note: the "datasets" object is a list of length n where each element is a data.table/data.frame object. From there, I isolate and save just the candidate variables (there are 30 for each of the n datasets) selected by LASSO using:

## This stores and prints out all of the regression 
## equation specifications selected by LASSO when called.
LASSO_Coeffs <- lapply(LASSO_fits, 
                       function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])

### Write my own custom function which will return a new list containing just the           
### Independent Variables which are 'selected' or chosen by LASSO for each individual dataset.
IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))

After that, the formulae I use to calculate all of the model performance metrics for this version and the replication version are the same, so I'll just skip all of that and jump to the same code sections in either the "LASSO using Lars (regression part only)" script, or the "LASSO using Lars" script; starting with the section which fits the n LASSOs:

set.seed(11)     # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(datasets, function(i) 
  lars(x = as.matrix(select(i, starts_with("X"))), 
         y = i$Y, type = "lasso")))


# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits, 
                       function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])

IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i > 0]))

Running both versions on the "top 50" folder which is included in the GitHub Repo linked to above, I got the following performance metrics (starting with enet):

> performance_metrics
  True.Positive.Rate   True.Negative.Rate   False.Positive.Rate
               0.987                   1                      0
  Underspecified.Models.Selected    Correctly.Specified.Models.Selected
                               2                                     48
  Overspecified.Models.Selected   Models.with.at.least.one.Omitted.Variable
                              0                                           2
  Models.with.at.least.one.Extra.Variable
                                        0


> performance_metrics
  True.Positive.Rate  True.Negative.Rate  False.Positive.Rate  Underspecified.Models.Selected
                   1                  1                    0                                0
  Correctly.Specified.Models.Selected   Overspecified.Models.Selected
                                   50                               0
  Models.with.at.least.one.Omitted.Variable   Models.with.at.least.one.Extra.Variable
                                          0                                         0

Due to an unfortunate artifact of how the datasets are constructed, all the observations in each of them have to be initially loaded into R's Environment as strings. This makes it computationally infeasible to run any of these scripts on more than 5,000, 10,000, or 15,000 of the 260,000 total synthetic datasets at a time. So, I will include two screenshots below, showing the slight difference in the performance of several sets of 5 or 10k LASSOs ran using enet (1st screenshot is for enet) vs replications on the same datasets using lars (2nd screenshot is lars):

Solution

Looking through the function I believe that there remain 2 arguments that differ between the two implementations. In particular lars has a different error tolerance in the argument eps which defaults to 1e-12 rather than .Machine$double.eps. Additionally the argument normalize has been set to FALSE in the enet implementation but is not specified in the lars implementation in your code which defaults to TRUE. I believe I managed to produce the same outputs after a brief correction to the arguments so the specifications were identical.

Note: in the example below I have used the same top 50 dataset from your github repo.

setwd("C:\\top 50")

library(tidyverse)

ls_obj = list.files(pattern="*.csv")
datasets = map(ls_obj, ~read_csv(., skip = 2L))

set.seed(11)
system.time(LASSO_fits <- lapply(X = datasets, function(i) 
  elasticnet::enet(x = as.matrix(dplyr::select(i, 
                                               starts_with("X"))), 
                   y = i$Y, lambda = 0, normalize = FALSE)))
#>    user  system elapsed 
#>    0.69    0.13    0.87

LASSO_Coeffs <- lapply(LASSO_fits, 
                       function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])

IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))

set.seed(11)     # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(datasets, function(i) 
  lars::lars(x = as.matrix(select(i, starts_with("X"))), 
       y = i$Y, type = "lasso", normalize = FALSE, eps = .Machine$double.eps)))
#>    user  system elapsed 
#>    0.34    0.00    0.39


# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits, 
                            function(i) predict(i, 
                                                x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                                s = 0.1, mode = "fraction", 
                                                type = "coefficients")[["coefficients"]])

IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i > 0]))

identical(IVs.Selected.by.Lars, IVs_Selected)
#> [1] TRUE

^{Created on 2023-02-03 with reprex v2.0.2}