Search code examples
rregressionlasso-regressionreproducible-research

How could one use either an lapply or a For Loop combined with an IF in R to return just the names of LASSO IVs with coefficients > 0?


I have via lots of legwork and previous helpful answers to previous Stackoverflow questions successfully ran a uniquely fit LASSO Regression to each of my 47,000 individual 500 row by 31 columns (30 IVs & 1 DV columns) datasets for a research project and stored them in a list called LASSO_fits. From there, I have also separated out and stored only the coefficients returned by these 47k LASSOs in a list called LASSO_Coeffs. My question is how I can extract just the names of all of the Independent Variables/factors/columns which have been 'selected', i.e. chosen for each dataset i (where i ranges from 1:47k+1) by each of these LASSO regressions and assign them to a new list? To clarify, when I say those which have been selected, I mean those factors whose coefficients are greater than 0.

My plan was to make sure the following code for the single case runs fine, then generalize it by combining it with either lapply or a For Loop:

if (LASSO_Coeffs[[1]][["X1"]] > 0) { 
  print(names(LASSO2_Coeffs[[1]][["X1"]]))
}

However, my plan got derailed when the above code returned the following:

> if (LASSO_Coeffs[[1]][["X1"]] > 0) { 
+  print(names(LASSO2_Coeffs[[1]][["X1"]]))
+ }
NULL

p.s. The following code to produce the LASSO_Coeffs & the LASSO_fits from whence it came are included below in case they are relevant (and the entire script, which is called "LASSO code.R" can be found in my Github repository): The code below is what I used to obtain all of the fitted LASSO estimates:

# This function fits all 47,000 LASSO regressions for/on
# each of the corresponding 47k datasets stored in the object
# of that name, then outputs standard regression results which 
# are typically called returned for any regression ran using R
set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(datasets, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
               y = i$Y, lambda = 0, normalize = FALSE))

Then, using the code below, I separated out from LASSO_fits just the estimated coefficients for all 30 Independent Variable/factor columns for each of them, and stored them as a list in the object LASSO_Coeffs using the following code:

# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
set.seed(11)     # to ensure replicability
LASSO_Coeffs <- lapply(LASSO_fits, 
                       function(i) predict(i,x = as.matrix(select(i,starts_with("X"))), 
                                           s = 0.1,mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])
LASSO_Coeffs[[1]]
> LASSO_Coeffs[[1]]
        X1         X2         X3         X4         X5         X6         X7 
0.15516986 0.07733003 0.00000000 0.27838089 0.00000000 0.00000000 0.12361868 
        X8         X9        X10        X11        X12        X13        X14 
0.31700186 0.13254325 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X15        X16        X17        X18        X19        X20        X21 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X22        X23        X24        X25        X26        X27        X28 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X29        X30 
0.00000000 0.00000000 

The problem with the above output is that unlike when, for example, running a Stepwise Regression in R using the step() function on this same dataset where the final output is just the coefficients & their names for the factors selected by that Stepwise, when running a LASSO using enet(), all of them are returned by default.


Solution

  • A very simple extension of my previous answer (the last line) gets only the coefficients above zero:

    library(dplyr)
    library(elasticnet)
    
    dfs <- lapply(list.files("sample_obs2", full.names = TRUE, recursive = TRUE), read.csv)
    
    models <- lapply(dfs, function(i) enet(x = as.matrix(select(i, starts_with("X"))), 
                       y = i$Y, lambda = 0, normalize = FALSE))
    
    coeffs <- lapply(models, function(i) predict(i, 
                            x = as.matrix(select(i, starts_with("X"))),
                            s = 0.1, mode = "fraction", type = "coefficients")[["coefficients"]])
    
    coeffs_above_zero <- lapply(coeffs, function(i) i[i > 0])
    

    Or alternatively to get only the names:

    coeffs_above_zero <- lapply(coeffs, function(i) names(i[i > 0]))