r linear-regression feature-selection montecarlo variable-selection

Distinguishing between structural & nonstructural regressor candidates in N Lassos run sequentially on N synthetic data sets

In this collaborative research project working towards a second draft of a 2008 Working Paper which proposed a promising straight-forward, yet novel Optimal Variable Selection Algorithm in Supervised Statistical Learning. The novel variable selection algorithm being explored and evaluated in this research has been coined 'Estimated Exhaustive Regression' by its innovator, my collaborator, the noted Econometrician Dr. Antony Davies.

A key characteristic of each of these 260k csv file-formatted 503 x 31 data sets is that equation which best describes, explains, and predicts the behavior of the true underlying regressors from the 30 initial candidates (known as "Structural Variables" in many parts of modern economic and econometric research) for each data set is known in advance of any analysis or any operations on any of these synthetic sample data sets whatsoever. This was done deliberately, by construction, in the way Dr. Davies wrote the script he used to create them via Monte Carlo Simulation. The way which of the 30 Candidate Variables are the true Underlying/Structural Variables for each is very simple included in the first 2 rows as you can see clearly below:

The first row is just a 30 cell long row of binary indicators, where a 1 indicates that candidate Variable is Structural/Explanatory/Predictive for that data set, and a 0 indicates that it is not.

At this point, I have the following code which all works great to load in my N data sets into R and mung/wrangle them before running my LASSO Regressions on them:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/Data/top 50"
paths_list <- list.files(path = folderpath, full.names = T, recursive = T)

# shorten the names of each of the datasets corresponding to 
# each file path in paths_list
DS_names_list <- basename(paths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)

# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the listed numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
paths_list = paths_list[my_order]

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- parallel::makeCluster(detectCores() - 4L)
parallel::clusterExport(CL, c('paths_list'))
system.time(datasets <- parLapply(cl = CL, X = paths_list, 
                                  fun = data.table::fread))

# change column names of all the columns in the data.table 'datasets'
datasets <- lapply(datasets, function(dataset_i) { 
  colnames(dataset_i) <- c("Y","X1","X2","X3","X4","X5","X6","X7","X8",
                           "X9","X10","X11","X12","X13","X14","X15",
                           "X16","X17","X18","X19","X20","X21","X22", 
                           "X23","X24","X25","X26","X27","X28","X29","X30")
  dataset_i })

dfs <- lapply(datasets, function(i) {i[-1:-3, ]})
dfs <- lapply(dfs, \(X) { lapply(X, as.numeric) })
dfs <- lapply(dfs, function(i) { as.data.table(i) })

Now, finally, here is how I am running my N LASSOs on my N data sets using the glmnet package's optional LASSO setting in its function of the same title:

set.seed(188)     # to ensure reproducibility
LASSO.fits <- lapply(X = dfs, function(I) 
               glmnet(x = as.matrix(select(I, starts_with("X"))), 
                  y = I$Y, alpha = 1))

# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
LASSO.coefs = LASSO.fits |> 
  Map(f = \(model) coef(model, s = .1))   
Variables.glmnet.LASSO.Selected <- LASSO.coefs |>
  Map(f = \(matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())   
Variables.glmnet.LASSO.Selected = lapply(seq_along(dfs), \(j)
                            j <- (Variables.glmnet.LASSO.Selected[[j]][-1]))

Where that last executable line of code creates an object whose contents look like this when printed out:

> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
 [1] "X1"  "X2"  "X8"  "X9"  "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
 [1] "X1"  "X4"  "X5"  "X6"  "X8"  "X9"  "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
 [1] "X4"  "X5"  "X6"  "X8"  "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"

So all I need now is a way to create a parallel list which stores the equivalent list of variable name strings only capturing the candidate regressors selected by glmnet's LASSO Regression on that data set, such that:

> head(Variables.glmnet.LASSO.Selected, n = 4)
[[1]]
 [1] "X1"  "X2"  "X8"  "X9"  "X10" "X12" "X16" "X17" "X18" "X19" "X20" "X22" "X23" "X26"
[[2]]
 [1] "X1"  "X4"  "X5"  "X6"  "X8"  "X9"  "X13" "X15" "X18" "X19" "X22" "X24" "X25" "X29"
[[3]]
 [1] "X4"  "X5"  "X6"  "X8"  "X10" "X12" "X13" "X14" "X16" "X17" "X18" "X21" "X22" "X25" "X30"

That is, if all of the first 4 the specifications (i.e., equations) selected by LASSO are correct, and one or more of these 4 lines could of course be different in a multitude of directions if the first 4 selected specifications are not necessarily correct by assumption!

p.s. Here all the the packages I load at the top of the script:

# load all necessary packages
library(plyr)
library(dplyr)
library(stringi)
library(stats)
library(leaps)
library(lars)
library(elasticnet)
library(data.table)
library(parallel)

Solution

Try this out to recreate that first row in your source datasets:

Structural_or_Non <- lapply(datasets, function(j) {j[1, -1]})

Then, just use an lapply with the names function applied to each element in the list you just created like this: Structural_Variables <- lapply(Structural_or_Non, function(i) { names(i)[i == 1] })

Nonstructural_Variables <- lapply(Structural_or_Non, function(i) {
  names(i)[i == 0] })

That should do the trick for you.