Search code examples
rtidyverselinear-regressionpurrr

How to Run Several Multivariate Regression Models on Nested Dataframes?


EDIT: I've resorted to using the mtcars dataset instead, as it better represents the structure of my real data. Disregard the iris example below:

I'm looking to run several regression models with multiple dependent and independent variables.

Say I'm working on the iris dataset.

My dependent variables would be Petal.Length and Petal.Width. My independent variables would be Sepal.Length and Sepal.Width.

I would like to run the following models for each species in the dataset separately (setosa, versicolor, virginica):

lm(Petal.Length ~ Sepal.Length)
lm(Petal.Length ~ Sepal.Width)
lm(Petal.Width ~ Sepal.Length)
lm(Petal.Width ~ Sepal.Width)

My real dataset has many more dependent and independent variables, which is why I'd like to write code that is less repetitious. Here is my attempt so far, where I've nested the Species column with Tidyr, and mutated a new column containing the regression data for each species:

library(dplyr)
library(tidyr)
library(purrr)

iris %>%
  nest(data = !Species) %>%
  mutate(
    fit = map(data, ~ lm (cbind(Petal.Length, Petal.Width) ~ ., data = .x)) 
  )

Which gives the equivalent of this for each species:

lm(Petal.Length ~ Sepal.Length + Sepal.Width)
lm(Petal.Width ~ Sepal.Length + Sepal.Width)

Let's use the mtcars dataset instead:

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
...

My dependent variables would be disp and hp. My independent variables would be drat, wt, and qsec.

I would like to run the following models for each vs and am type in the dataset separately:

lm(disp ~ drat)
lm(disp ~ wt)
lm(disp ~ qsec)
lm(hp ~ drat)
lm(hp ~ wt)
lm(hp ~ qsec)

My real dataset has many more dependent and independent variables, which is why I'd like to write code that is less repetitious. Here is my attempt so far, where I've nested the vs and am columns with Tidyr, and mutated a new column containing the regression data for each vs and am :

library(dplyr)
library(tidyr)
library(purrr)

mtcars %>%
  group_by(vs,am) %>%
  nest() %>%
  mutate(
    fit = map(data, ~ lm (cbind(disp, hp) ~ drat + wt + qsec, data = .x)) 
  )

# A tibble: 4 x 4
# Groups:   vs, am [4]
     vs    am data              fit   
  <dbl> <dbl> <list>            <list>
1     0     1 <tibble [6 x 9]>  <mlm> 
2     1     1 <tibble [7 x 9]>  <mlm> 
3     1     0 <tibble [7 x 9]>  <mlm> 
4     0     0 <tibble [12 x 9]> <mlm> 

Which gives the equivalent of the following formulas for each combination of vs and am:

lm(disp ~ drat + wt + qsec)
lm(hp ~ drat + wt + qsec)

But this isn't what I want. The above code gives two multiple regressions per vs and am combination, when I want six simple regressions. I have no idea how to produce this in R.


Solution

  • This is modelling, and using the correct formula after reshaping the data, you can get the desired results without the loops.

    new_data <- pivot_longer(iris, c(Sepal.Length, Sepal.Width), 
                                    names_to = 'Sepal', names_prefix = 'Sepal')
    
    lm(cbind(Petal.Width, Petal.Length) ~ value/(Species:Sepal) - value + 
                        Sepal/Species - Sepal + 0, new_data)
    
    Call:
    lm(formula = cbind(Petal.Width, Petal.Length) ~ value/(Species:Sepal) - 
                    value + Sepal/Species - Sepal + 0, data = new_data)
    
    Coefficients:
                                          Petal.Width  Petal.Length
    Speciessetosa:Sepal.Length            -0.17022      0.80305    
    Speciesversicolor:Sepal.Length         0.08326      0.18512    
    Speciesvirginica:Sepal.Length          1.22611      0.61047    
    Speciessetosa:Sepal.Width              0.02418      1.18292    
    Speciesversicolor:Sepal.Width          0.16691      1.93492    
    Speciesvirginica:Sepal.Width           0.66406      3.51090    
    value:Speciessetosa:Sepal.Length       0.08314      0.13163    
    value:Speciesversicolor:Sepal.Length   0.20936      0.68647    
    value:Speciesvirginica:Sepal.Length    0.12142      0.75008    
    value:Speciessetosa:Sepal.Width        0.06471      0.08141    
    value:Speciesversicolor:Sepal.Width    0.41845      0.83938    
    value:Speciesvirginica:Sepal.Width     0.45795      0.68632    
    

    If you want to run the for loops:

    iris %>%
      nest(data = !Species) %>%
      summarise(model = map(c('Sepal.Length', 'Sepal.Width'),
                                   ~map(data,~lm(reformulate(.y, 'cbind(Petal.Width, Petal.Length)'),.x) %>%
                                                   coef()%>%
                                                   as_tibble(rownames = 'rn'), .y = .x)%>%
                              set_names(Species)%>%
                              bind_rows(.id='Species')))%>%
      unnest(model) 
    
    # A tibble: 12 x 4
       Species    rn           Petal.Width Petal.Length
       <chr>      <chr>              <dbl>        <dbl>
     1 setosa     (Intercept)      -0.170        0.803 
     2 setosa     Sepal.Length      0.0831       0.132 
     3 versicolor (Intercept)       0.0833       0.185 
     4 versicolor Sepal.Length      0.209        0.686 
     5 virginica  (Intercept)       1.23         0.610 
     6 virginica  Sepal.Length      0.121        0.750 
     7 setosa     (Intercept)       0.0242       1.18  
     8 setosa     Sepal.Width       0.0647       0.0814
     9 versicolor (Intercept)       0.167        1.93  
    10 versicolor Sepal.Width       0.418        0.839 
    11 virginica  (Intercept)       0.664        3.51  
    12 virginica  Sepal.Width       0.458        0.686 
    

    EDIT: in the case of mtcars dataset, the following will suffice:

    pivot_longer(mtcars, c(drat, wt, qsec)) %>%
      unite('am_vs', c(am, vs)) %>%
      lm(cbind(disp, hp)~value/(am_vs:name)-value + name/am_vs - name + 0, .)
    

    Running using the map:

    mtcars%>%
      nest_by(vs, am)%>%
      rowwise()%>%
      summarise(vs, am, model = map(c('drat', 'wt', 'qsec'),
                     ~lm(sprintf('cbind(disp, hp)~%s',.x), data)%>%
                       coef()%>%
                       as_tibble(rownames = 'rn')))%>%
      unnest(model)