Search code examples
rfeature-selectionglmnetlasso-regressionvariable-selection

How to iteratively remove just the intercept terms from the variables selected by n glmnet functions run on n datasets in R


I have run N individual LASSO Regressions on N different data sets using the glmnet() function from the package of the same name in RStudio using the following lines of code:

# This function fits all n LASSO regressions for/on
# each of the corresponding n datasets stored in the object
# of that name, then outputs standard regression results which 
# are typically called returned for any regression ran using R.
set.seed(11)     # to ensure replicability
system.time(L.fits <- lapply(X = datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1)))

From there, it took me a long time, and a decent amount of help from the S/O community here, but I was able to put some lines of code together which isolate, then extract just the names of all the candidate variables glmnet's LASSO selected for each of the N data sets. They are here:

# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
L.coefs = L.fits |> 
  Map(f = \(model) coef(model, s = .1))

Variables.Selected <- L.coefs |>
  Map(f = \(matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())


>  head(Variables.Selected, n = 4)
[[1]]
 [1] "(Intercept)" "X1"          "X6"          "X7"         
 [5] "X8"          "X10"         "X11"         "X13"        
 [9] "X15"         "X17"         "X20"         "X22"        
[13] "X24"         "X26"         "X27"         "X28"        
[17] "X29"         "X30"        

[[2]]
 [1] "(Intercept)" "X3"          "X5"          "X8"         
 [5] "X9"          "X13"         "X14"         "X16"        
 [9] "X19"         "X20"         "X24"         "X25"        
[13] "X26"         "X29"         "X30"        

[[3]]
 [1] "(Intercept)" "X1"          "X4"          "X5"         
 [5] "X10"         "X12"         "X13"         "X14"        
 [9] "X19"         "X20"         "X21"         "X24"        
[13] "X25"         "X27"         "X29"        

[[4]]
 [1] "(Intercept)" "X3"          "X4"          "X5"         
 [5] "X9"          "X10"         "X11"         "X14"        
 [9] "X17"         "X18"         "X22"         "X24"        
[13] "X26"         "X27"         "X28"

This gives me everything I want, but it also returns the intercept term for every LASSO fit which will cause me big problems later on in the script when I am measuring the performance of these n LASSOs in terms of how many of the regression specifications they 'select' are correctly specified, i.e. if the true aka structural regression equation for dataset 4 is Y = X3 + X4 + X5 + X9 + X10 + X11 + X14 + X17 + X18 + X22 + X24 + X26 + X27 + X28, then that model which LASSO selected is correctly specified because all the included variables match, and there are none missing and no extras... BUT, I can only do this part, the point of all of this, if I can directly compare the above output to the output of an object I have designated Structural_Variables which contains the true variables for each dataset. So, as long as the intercepts remain where they are, no regression equation selected by glmnet can ever be scored as 'correctly specified'. Which means my performance metrics will be nonsensical.

p.s. Just to clarify, the Structural_Variables object has already been created for each data set at this point, and looks like this:

> head(Structural_Variables, n = 4)
[[1]]
 [1] "X1"  "X6"  "X7"  "X8"  "X10" "X11" "X13" "X17" "X20" "X24"
[11] "X26" "X27" "X28" "X30"

[[2]]
 [1] "X3"  "X5"  "X8"  "X9"  "X13" "X14" "X16" "X19" "X20" "X24"
[11] "X25" "X26" "X29" "X30"

[[3]]
 [1] "X1"  "X4"  "X5"  "X10" "X12" "X13" "X14" "X19" "X20" "X21"
[11] "X24" "X25" "X27" "X29"

[[4]]
 [1] "X3"  "X4"  "X5"  "X9"  "X10" "X11" "X14" "X17" "X18" "X22"
[11] "X24" "X26" "X27" "X28"

Solution

  • Try out this, or some slight perturbation on it and see how it goes:

    Variables.Selected <- L.coefs |>
      Map(f = \(matr) matr |> as.matrix() |> 
           as.data.frame() |> filter(s1 != 0) |> rownames())
    
    Variables.Selected = lapply(seq_along(datasets), \(no_Ints)
                                             no_Ints <- (Variables.Selected[[no_Its]][-1]))