Search code examples
rdataframefor-loopregression

How to do regression modelling in R multiple (one at a time) to multiple column of a dataframe?


I'm trying to do regression modeling in R. I have a dataframe of 100000 rows and 51 columns. The first few columns are response variables and the rest of the columns are predictor variables. Where my goal is to execute "ranger" for random forest model on each response variable at a time for all predictor variables and generate distinct output objects for every response variables. I've some other dataframes where the number of response variable is more than 100, so it is not feasible to add column names of every response variable and predictor variable every time doing the regression analysis.

This is a short version of the dataframe, in which the response variable name starts with "EE..." and the predictor variables end with "_h16". The dataframe looks like this below,

head(GC05cr_h16_dat3)
# A tibble: 7 × 25
EE87865ln1   EE87866ln1   EE87895ln1  blood_vessel_h16 adrenal_gland_h16 bone_element_h16 bronchus_h16 
       <dbl>       <dbl>      <dbl>     <dbl>             <dbl>            <dbl>        <dbl>         
1    0.00391     0.00326    0.00332       0                 0                1            0             
2    0.00139     0.00116    0.00132       0                 0                0            0             
3    0.00360     0.00270    0.00469       1                 1                0            1             
4    0.00323     0.00348    0.00339       0                 0                1            0             
5    0.00323     0.00330    0.00382       0                 1                0            0             
6    0.00278     0.00208    0.00214       0                 0                1            0             

What I am trying to do so far,

for (i in names(GC05cr_h16_dat3)[grep("EE", names(GC05cr_h16_dat3))]){
    rfR.res <- ranger(
      as.formula(paste(colnames(GC05cr_h16_dat3[ , i])), "~ ."),
                                data = GC05cr_h16_dat3, importance="impurity"
      ) 
    assign(paste0("rfR_Res_", names(GC05cr_h16_dat3)[i]), rfR.res, envir = .GlobalEnv)
  }
 

ERROR:

Error in formula.character(object, env = baseenv()) : 
  invalid formula "EE87865ln1": not a call

My expected output rf objects name would look like this....

rfR_Res_EE87865ln1
rfR_Res_EE87866ln1
rfR_Res_EE87895ln1
................
................

Solution

  • reformulate is an useful function you can use to make formulas with strings.

    Let’s consider the mtcars data set, you can run a regression this way:

    lm(reformulate(termlabels = c("disp", "wt"), response = "mpg"), data = mtcars)
    #> 
    #> Call:
    #> lm(formula = reformulate(termlabels = c("disp", "wt"), response = "mpg"), 
    #>     data = mtcars)
    #> 
    #> Coefficients:
    #> (Intercept)         disp           wt  
    #>    34.96055     -0.01772     -3.35083
    

    In this example, termlabels and response are vectors you can generate the way you decide. For example, since you identify the variables by position, you can subset and loop to run all regressions without creating another data frame.

    I would do something as follow:

    head(mtcars)
    #>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
    #> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    #> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    #> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    #> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    #> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    #> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    
    1. pick your variables
    all_variables <- names(mtcars)
    response_variables <- all_variables[c(1, 3)]
    predictors <- all_variables[-c(1, 3)]
    
    1. Run the models
    all_models <- lapply(
      response_variables,
      function(x) lm(reformulate(termlabels = predictors, response = x), data = mtcars)
    ) |>
      setNames(response_variables)
    

    You end with a list with one model per response variable

    all_models
    #> $mpg
    #> 
    #> Call:
    #> lm(formula = reformulate(termlabels = predictors, response = x), 
    #>     data = mtcars)
    #> 
    #> Coefficients:
    #> (Intercept)          cyl           hp         drat           wt         qsec  
    #>    12.55052      0.09627     -0.01295      0.92864     -2.62694      0.66523  
    #>          vs           am         gear         carb  
    #>     0.16035      2.47882      0.74300     -0.61686  
    #> 
    #> 
    #> $disp
    #> 
    #> Call:
    #> lm(formula = reformulate(termlabels = predictors, response = x), 
    #>     data = mtcars)
    #> 
    #> Coefficients:
    #> (Intercept)          cyl           hp         drat           wt         qsec  
    #>     18.5336      15.5757       0.6398      10.6131      81.6154     -11.6838  
    #>          vs           am         gear         carb  
    #>    -11.8042      -3.1052       6.5678     -31.3033
    

    PS: A good resource with example for something similar is chapter 25 of the book R for data science

    Created on 2023-08-03 with reprex v2.0.2