Search code examples
rfeature-engineeringlasso-regressionreproducible-researchelasticnet

How to systematically replicate the results of running n LASSOs on n data sets in R using enet() with lars()


My code used to fit k LASSO Regressions on k csv file-formatted data sets via the enet() function from the following:

set.seed(150)
system.time(LASSO <- lapply(datasets, function(J) 
               elasticnet::enet(x = as.matrix(dplyr::select(J, 
                                         starts_with("X"))), 
               y = J$Y, lambda = 0, normalize = FALSE)))

The code to extract the coefficients from those k estimates is:

## This stores and prints out the estimates for all of the regression 
## equation specifications selected by LASSO when called.
LASSO_Coeffs <- lapply(LASSO, 
                       function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]]) 

The line of code to isolate and store the names the of all the variables with positive coefficient estimates only:

IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0])) 

What I want is the syntax required to replicate this process exactly using the lars() function from the lars package (or perhaps some other function from some other package in R which has the ability to estimate a LASSO Regression which I have not heard of).

p.s. Here is all of the code I used to load/import the n data sets into R and store them in the 'datasets' list just in case this added context is of any use whatsoever:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/0.5-5-1-1 to 0.5-6-10-500"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# reformat the names of each of the csv file formatted dataset
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)


# The code below reads the data into the RStudio Workspace from
# each of the n datasets in an iterative manner in such a way 
# that it assigns each of them to the corresponding name of that 
# dataset in the file folder they are stored in.
system.time( datasets <- lapply(paths_list, fread) )

I used fread because I am loading 5, 10, or 15k datasets at a time here; and they all initially load as characters/strings due to a quick of their construction.

# change column names of all the columns in the data.table 'datasets'
datasets <- lapply(datasets, function(dataset_i) { 
  colnames(dataset_i) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8",
                           "X9","X10","X11","X12","X13","X14","X15",
                           "X16","X17","X18","X19","X20","X21","X22", 
                           "X23","X24","X25","X26","X27","X28","X29","X30")
  dataset_i })

Structural_IVs <- lapply(datasets, function(j) {j[1, -1]})
Structural_Variables <- lapply(Structural_IVs, function(i) {names(i)[i == 1]})

datasets <- lapply(datasets, function(i) {i[-1:-3, ]})
datasets <- lapply(datasets, \(X) { lapply(X, as.numeric) })
datasets <- lapply(datasets, function(i) { as.data.table(i) })

Solution

  • Going off of the syntax of the code snippets you used in this post, something like this ought to do the trick:

    set.seed(150)     # to ensure replicability
    LASSO.Lars.fits <- lapply(X = datasets, function(i) 
      lars(x = as.matrix(select(i, starts_with("X"))), 
             y = i$Y, type = "lasso"))
    

    However, if you have the time, I would recommend also seeing if you can replicate your set of variables selected using glmnet as well as lars. That way, you could know if let's just say, you get different variables 'selected' by lars than you did with enet, which of these sets is identical to the corresponding set of optimal variables selected by glmnet. Otherwise, you would have to just assume that the sets selected by enet were all valid or the opposite.