Search code examples
rmachine-learninglapplymontecarloreproducible-research

Is there a way 2 store factors selected by a (BE) Stepwise Regression run on N datasets via lapply(full_model, FUN(i) {step(i[[“Coeffs”]])})?


I have already written the following code, all of which works OK:

directory_path <- "~/DAEN_698/sample_obs"
file_list <- list.files(path = directory_path, full.names = TRUE, recursive = TRUE)
head(file_list, n = 2)
> head(file_list, n = 2)
[1] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-1.csv"
[2] "C:/Users/Spencer/Documents/DAEN_698/sample_obs2/0-5-1-2.csv"

# Create another list with the just the "n-n-n-n" part of the names of of each dataset
DS_name_list = stri_sub(file_list, 49, 55)
head(DS_name_list, n = 3)
> head(DS_name_list, n = 3)
[1] "0-5-1-1" "0-5-1-2" "0-5-1-3"

# This command reads all the data in each of the N csv files via their names 
# stored in the 'file_list' list of characters.
csvs <- lapply(file_list, read.csv)

### Run a Backward Elimination Stepwise Regression on each of the N csvs.
# Assign the full model (meaning the one with all 30 candidate regressors 
# included as the initial model in step 1).
# This is crucial because if the initial model has less than the number of 
# total candidate factors for Stepwise to select from in the datasets, 
# then it could miss 1 or more of the true factors. 
full_model <- lapply(csvs, function(i) {
  lm(formula = Y ~ ., data = i) })

# my failed attempt at figuring it out myself
set.seed(50)      # for reproducibility
BE_fits3 <- lapply(full_model, function(i) {step(object = i[["coefficients"]], 
direction = 'backward', scope = formula(full_model), trace = 0)})

When I hit run on the above 2 lines of code after setting the seed, I get the following error message in the Console:

Error in terms`(object) : object 'i' not found

To briefly elaborate a bit further on why it is absolutely essential that the initial model when running a Backward Elimination version of Stepwise Regression, consider the following example:

Let us say that we start out with an initial model of 25, so, X1:X26 instead of X1:X30, in that case, it would be possible to miss out on Stepwise Regression j being able to select/choose 1 or more of the IVs/factors from X26 through X30, especially if 1 or more of those really are included in the true underlying population model that characterizes dataset j.


Solution

  • Instead of two lapply loops, one to fit the models and the second to run the stepwise regressions, use a for loop doing both operations one after the other. This is an environments thing, it seems that step is not finding the data when run in the environment of the lapply function.

    I have also changed the code to create DS_name_list. Below it processes the full names without string position dependent code.

    DS_name_list <- basename(file_list)
    DS_name_list <- tools::file_path_sans_ext(DS_name_list)
    head(DS_name_list, n = 2)
    

    And here is the regressions code.

    csvs <- lapply(file_list, read.csv)
    names(csvs) <- DS_name_list
    
    set.seed(50)      # for reproducibility
    full_model <- vector("list", length = length(csvs))
    BE_fits3 <- vector("list", length = length(csvs))
    
    for(i in seq_along(csvs)) {
      full_model[[i]] <- lm(formula = Y ~ ., data = csvs[[i]])
      BE_fits3[[i]] <- step(object = full_model[[i]], 
                            scope = formula(full_model[[i]]),
                            direction = 'backward',
                            trace = 0)
    }