Search code examples
rloopslasso-regressionreproducible-research

Is there a simple way to generalize code which successfully fits a LASSO Regression to a single csv to n csvs in a file folder?


My goal/task is to fit a LASSO Regression function using the enet() function from the elastic net package in R to each of the 47,000 individual csv file formatted datasets which are all located in the same large file folder called "sample_obs". Each csv file's name is formatted as follows: #-#-#-#, for example, the first 3 of them are called: "0.4-3-1-1", "0.4-3-1-2", "0.4-3-1-3"

Once all of them have been fitted on each dataset and their output stored in a list with 47k elements, all I have left to do is to separate out just the factors (aka predictors or Independent Variables) chosen/'selected' by LASSO j for dataset j and store each of them in another list. So, my final desired output should look like 1 of the following for each list element, either: X#, X#, X#, X#, etc. (the number of X#s returned for any given dataset can range anywhere from 1 to 30 because each dataset has 30 candidate predictors/factors in it) OR 1, 2, 5, 6, 9, 26, 29 as just 1 possible example.

Towards completing this task, I decided to start out with figuring out how to do all of this on a single one of the csv file formatted datasets which has been loaded into R by itself and assigned to its own object. To do this, I used a csv file from a much smaller dataset folder called "sample_obs2" in order to GREATLY reduce the runtime required! You can find the sample_obs2 dataset on my Github account in the "Estimated-Exhaustive_Regression-Project" repository.

Here is the code I wrote and the output I got for that simpler version:

setwd("~/DAEN_698/other datasets/sample_obs2")
> setwd("~/DAEN_698/other datasets/sample_obs2")
> getwd()
[1] "C:/Users/Spencer/Documents/DAEN_698/other datasets/sample_obs2"

# read the data in from the first csv file in the file folder
dataset_1 <- read.csv("0-5-1-1.csv")
head(dataset_1, n = 1)
> head(dataset_1, n = 1)
        Y       X1       X2        X3        X4       X5        X6         X7       X8         X9
1 5.70511 1.339406 1.033558 0.4749296 0.3720555 0.928961 0.3804003 -0.4386075 0.786346 -0.6860546
         X10        X11         X12        X13      X14        X15       X16       X17       X18
1 -0.8863821 -0.9128645 -0.08443444 -0.2918255 1.527747 -0.8496993 0.9825339 0.8999604 -1.047078
         X19       X20        X21        X22      X23       X24         X25       X26      X27
1 0.07337369 -1.429877 -0.1062012 -0.6954525 1.025954 0.7472764 -0.02252112 0.0932389 1.173201
       X28       X29       X30
1 2.061864 -1.129998 0.1931626

set.seed(50)
LASSO2_fit1 <- enet(x = as.matrix(dataset_1[2:31]), 
                   y = dataset_1$Y, lambda = 0, normalize = FALSE)
 
LASSO_coeffs1 <- predict(LASSO2_fit1, 
                        x = as.matrix(dataset_1[2:31]),
                        s = 0.1, mode = "fraction", type = "coefficients")
LASSO_coeffs1[["coefficients"]]
> LASSO_coeffs1[["coefficients"]]
        X1         X2         X3         X4         X5         X6         X7         X8         X9 
0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000 
       X10        X11        X12        X13        X14        X15        X16        X17        X18 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X19        X20        X21        X22        X23        X24        X25        X26        X27 
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
       X28        X29        X30 
0.00000000 0.00000000 0.00000000

This is fairly close to the final format of the output I am looking for, I am sure something like a For Loop with an IF function inside of it can get me the rest of the way there once I know how to repeat the above process and results for all 47k datasets! But, my problem is I have tried and failed at repeating that aforementioned process the iteratively over all 47k of my datasets.


Solution

  • All we have to do is lapply these functions to the list of dataframes that you have. Just one row of output coefficients per csv as expected.

    library(dplyr)
    
    dfs <- lapply(list.files("sample_obs2", full.names = TRUE, recursive = TRUE), read.csv)
    
    models <- lapply(dfs, function(i) enet(x = as.matrix(select(i, starts_with("X"))), 
                       y = i$Y, lambda = 0, normalize = FALSE))
    
    coeffs <- lapply(models, function(i) predict(i, 
                            x = as.matrix(select(i, starts_with("X"))),
                            s = 0.1, mode = "fraction", type = "coefficients")[["coefficients"]])
    
    coeffs[[1]]
    
    #         X1         X2         X3         X4         X5         X6         X7         X8         X9        X10        X11        X12        X13        X14        X15        X16        X17        X18        X19        X20 
    # 0.20039732 0.13671726 0.12411170 0.06292652 0.07892046 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 
    #        X21        X22        X23        X24        X25        X26        X27        X28        X29        X30 
    # 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000