I'm trying to do regression modeling in R. I have a dataframe of 100000 rows and 51 columns. The first few columns are response variables and the rest of the columns are predictor variables. Where my goal is to execute "ranger"
for random forest model on each response variable at a time for all predictor variables and generate distinct output objects for every response variables. I've some other dataframes where the number of response variable is more than 100, so it is not feasible to add column names of every response variable and predictor variable every time doing the regression analysis.
This is a short version of the dataframe, in which the response variable name starts with "EE..."
and the predictor variables end with "_h16"
. The dataframe looks like this below,
head(GC05cr_h16_dat3)
# A tibble: 7 × 25
EE87865ln1 EE87866ln1 EE87895ln1 blood_vessel_h16 adrenal_gland_h16 bone_element_h16 bronchus_h16
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.00391 0.00326 0.00332 0 0 1 0
2 0.00139 0.00116 0.00132 0 0 0 0
3 0.00360 0.00270 0.00469 1 1 0 1
4 0.00323 0.00348 0.00339 0 0 1 0
5 0.00323 0.00330 0.00382 0 1 0 0
6 0.00278 0.00208 0.00214 0 0 1 0
What I am trying to do so far,
for (i in names(GC05cr_h16_dat3)[grep("EE", names(GC05cr_h16_dat3))]){
rfR.res <- ranger(
as.formula(paste(colnames(GC05cr_h16_dat3[ , i])), "~ ."),
data = GC05cr_h16_dat3, importance="impurity"
)
assign(paste0("rfR_Res_", names(GC05cr_h16_dat3)[i]), rfR.res, envir = .GlobalEnv)
}
ERROR:
Error in formula.character(object, env = baseenv()) :
invalid formula "EE87865ln1": not a call
My expected output rf objects name would look like this....
rfR_Res_EE87865ln1
rfR_Res_EE87866ln1
rfR_Res_EE87895ln1
................
................
reformulate
is an useful function you can use to make formulas with strings.
Let’s consider the mtcars
data set, you can run a regression this way:
lm(reformulate(termlabels = c("disp", "wt"), response = "mpg"), data = mtcars)
#>
#> Call:
#> lm(formula = reformulate(termlabels = c("disp", "wt"), response = "mpg"),
#> data = mtcars)
#>
#> Coefficients:
#> (Intercept) disp wt
#> 34.96055 -0.01772 -3.35083
In this example, termlabels
and response
are vectors you can generate the
way you decide. For example, since you identify the variables by position, you can
subset and loop to run all regressions without creating another data frame.
I would do something as follow:
head(mtcars)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
all_variables <- names(mtcars)
response_variables <- all_variables[c(1, 3)]
predictors <- all_variables[-c(1, 3)]
all_models <- lapply(
response_variables,
function(x) lm(reformulate(termlabels = predictors, response = x), data = mtcars)
) |>
setNames(response_variables)
You end with a list with one model per response variable
all_models
#> $mpg
#>
#> Call:
#> lm(formula = reformulate(termlabels = predictors, response = x),
#> data = mtcars)
#>
#> Coefficients:
#> (Intercept) cyl hp drat wt qsec
#> 12.55052 0.09627 -0.01295 0.92864 -2.62694 0.66523
#> vs am gear carb
#> 0.16035 2.47882 0.74300 -0.61686
#>
#>
#> $disp
#>
#> Call:
#> lm(formula = reformulate(termlabels = predictors, response = x),
#> data = mtcars)
#>
#> Coefficients:
#> (Intercept) cyl hp drat wt qsec
#> 18.5336 15.5757 0.6398 10.6131 81.6154 -11.6838
#> vs am gear carb
#> -11.8042 -3.1052 6.5678 -31.3033
PS: A good resource with example for something similar is chapter 25 of the book R for data science
Created on 2023-08-03 with reprex v2.0.2