Search code examples
rmachine-learningreplicationlasso-regression

Correcting the output (strictly in terms the features/variables selected) of n LASSO Regressions on n datasets using glmnet


Note: This is a follow up to a previous question I have asked here on SO which received an answer that runs, but generates incorrect output. So, this question will incorporate the proposed answer to the previous question in order to show that although it works, its output is not what I need, and will provide examples of what the output should look like.

The code snippets included here can all be found in my GitHub Repository for this project, in one or more of the following three Rscript files:

  • LASSO using glmnet (practice version)
  • LASSO script (practice version)
  • LASSO Regressions

Importantly, if you want to exactly replicate my results locally on your system, use the folder with just 10 of the datasets up on the GitHub Repo called "ten".

So, quickly, and avoiding some lines of code used to reorder the list of file paths before importing the data to ensure they end up in the proper and some other commands used to similar housekeeping type issues, here is what I have that works properly:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/ten"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# import/load the datasets    
datasets <- lapply(paths_list, fread)

Structural_IVs <- lapply(datasets, function(j) {j[1, -1]})
True_Regressors <- lapply(Structural_IVs, function(i) {names(i)[i == 1]})

datasets <- lapply(datasets, function(i) {i[-1:-3, ]})
datasets <- lapply(datasets, \(X) { lapply(X, as.numeric) })
datasets <- lapply(datasets, function(i) { as.data.table(i) })

# fitting the n LASSO Regressions using glmnet
set.seed(11)     # to ensure replicability
system.time(LASSO.fits <- lapply(datasets, function(i) 
           glmnet(x = as.matrix(select(i, starts_with("X"))), 
                  y = i$Y, alpha = 0)))

Where the output is of the classes "list" and "glmnet"

> class(LASSO.fits)
[1] "list"
> class(LASSO.fits[[1]])
[1] "elnet"  "glmnet"

What I asked for in the previous question was how to get from here to having just the names of the variables/features selected by each LASSO returned to me and the following method was proposed:

L_coefs = LASSO.fits |> 
  Map(f = \(model) coef(model, s = .1))

Variables_Selected <- L_coefs |>
  Map(f = \(matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())

The problem is, once all of this has been run, what I end up with is that every LASSO has performed no selection at all because each resulting model has all 30 candidate features in it plus the intercept:

> head(Variables_Selected, n = 3)
[[1]]
 [1] "(Intercept)" "X1"          "X2"          "X3"          "X4"          "X5"         
 [7] "X6"          "X7"          "X8"          "X9"          "X10"         "X11"        
[13] "X12"         "X13"         "X14"         "X15"         "X16"         "X17"        
[19] "X18"         "X19"         "X20"         "X21"         "X22"         "X23"        
[25] "X24"         "X25"         "X26"         "X27"         "X28"         "X29"        
[31] "X30"        

[[2]]
 [1] "(Intercept)" "X1"          "X2"          "X3"          "X4"          "X5"         
 [7] "X6"          "X7"          "X8"          "X9"          "X10"         "X11"        
[13] "X12"         "X13"         "X14"         "X15"         "X16"         "X17"        
[19] "X18"         "X19"         "X20"         "X21"         "X22"         "X23"        
[25] "X24"         "X25"         "X26"         "X27"         "X28"         "X29"        
[31] "X30"        

[[3]]
 [1] "(Intercept)" "X1"          "X2"          "X3"          "X4"          "X5"         
 [7] "X6"          "X7"          "X8"          "X9"          "X10"         "X11"        
[13] "X12"         "X13"         "X14"         "X15"         "X16"         "X17"        
[19] "X18"         "X19"         "X20"         "X21"         "X22"         "X23"        
[25] "X24"         "X25"         "X26"         "X27"         "X28"         "X29"        
[31] "X30" 

p.s. By contrast, when printing out the results of the selections actually made by LASSO for the first 3 datasets found using the enet function as shown in the previous question linked above, I got the following (which is what I presumably ought to get here too):

> head(LASSOs_Selections, n = 3)
[[1]]
[1] "X11" "X16"

[[2]]
[1] "X6"  "X7"  "X20"

[[3]]
[1] "X9"  "X10" "X20" 

Solution

  • You have set alpha = 0 rather than alpha = 1 in your glmnet function which fits all of your LASSOs.

    So, instead of what you currently have, which looks like this:

    # fitting the n LASSO Regressions using glmnet
    set.seed(11)     # to ensure replicability
    system.time(LASSO.fits <- lapply(datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 0)))
    

    It should instead be altered to this:

    # fitting the n LASSO Regressions using glmnet
    set.seed(11)     # to ensure replicability
    system.time(LASSO.fits <- lapply(datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1)))