Search code examples
rdplyrregressiontidyrbroom

"Multi-step" regression with broom and dplyr in R


I am looking for a way to perform "multi-step" regression with broom and dplyr in R. I use "multi-step" as a placeholder for regression analyses in which you integrate in the final regression model elements of previous regression models, such as the fit or the residuals. An example for such a "multi-step" regression would be the 2SLS approach for Instrumental Variable (IV) regression.

My (grouped) data looks like this:

df <- data.frame(
  id = sort(rep(seq(1, 20, 1), 5)),
  group = rep(seq(1, 4, 1), 25),
  y = runif(100),
  x = runif(100),
  z1 = runif(100),
  z2 = runif(100)
)

where id and group are identifiers, y the dependent variable, while x, z1 and z2 are predictors. In a IV setting x would be an endogenous predictor.

Here is an example for a "multi-step" regression:

library(tidyverse)
library(broom)

# Nest the data frame
df_nested <- df %>% 
  group_by(group) %>% 
  nest()

# Run first stage regression and retrieve residuals
df_fit <- df_nested %>% 
  mutate(
    fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
    resids = map(fit1, residuals) 
  )

# Run second stage with residuals as control variable
df_fit %>% 
  mutate(
    fit2 = map2(data, resids, ~ tidy(lm(y ~ x + z2 + .y["resids"], data = .x)))
        ) %>% 
  unnest(fit2)

This produces an error, which indicates that .x and .y have different lengths. What is a solution to integrate the residuals, in this attempt the .y["resids"], into the second regression as a control variable?


Solution

  • One option to achieve your desired result would be to add the residuals as a new column to your dataframe after the first stage regression:

    library(tidyverse)
    library(broom)
    
    # Nest the data frame
    df_nested <- df %>% 
      group_by(group) %>% 
      nest()
    
    # Run first stage regression and retrieve residuals
    df_fit <- df_nested %>% 
      mutate(
        fit1 = map(data, ~ lm(x ~ z1 + z2, data = .x)),
        resids = map(fit1, residuals),
        data = map2(data, resids, ~ bind_cols(.x, resids = .y))
      )
    
    # Run second stage with residuals as control variable
    df_fit %>% 
      mutate(
        fit2 = map(data, ~ tidy(lm(y ~ x + z2 + resids, data = .x)))
      ) %>% 
      unnest(fit2)
    #> # A tibble: 16 × 9
    #> # Groups:   group [4]
    #>    group data        fit1   resids  term    estimate std.error statistic p.value
    #>    <dbl> <list>      <list> <list>  <chr>      <dbl>     <dbl>     <dbl>   <dbl>
    #>  1     1 <tibble [2… <lm>   <dbl [… (Inter…   0.402      0.524    0.767  0.451  
    #>  2     1 <tibble [2… <lm>   <dbl [… x         0.0836     0.912    0.0917 0.928  
    #>  3     1 <tibble [2… <lm>   <dbl [… z2        0.161      0.250    0.644  0.527  
    #>  4     1 <tibble [2… <lm>   <dbl [… resids   -0.0536     0.942   -0.0569 0.955  
    #>  5     2 <tibble [2… <lm>   <dbl [… (Inter…   0.977      0.273    3.58   0.00175
    #>  6     2 <tibble [2… <lm>   <dbl [… x        -0.561      0.459   -1.22   0.235  
    #>  7     2 <tibble [2… <lm>   <dbl [… z2       -0.351      0.192   -1.82   0.0826 
    #>  8     2 <tibble [2… <lm>   <dbl [… resids    0.721      0.507    1.42   0.170  
    #>  9     3 <tibble [2… <lm>   <dbl [… (Inter…  -0.710      1.19    -0.598  0.556  
    #> 10     3 <tibble [2… <lm>   <dbl [… x         3.61       3.80     0.951  0.352  
    #> 11     3 <tibble [2… <lm>   <dbl [… z2       -1.21       1.19    -1.01   0.323  
    #> 12     3 <tibble [2… <lm>   <dbl [… resids   -3.67       3.80    -0.964  0.346  
    #> 13     4 <tibble [2… <lm>   <dbl [… (Inter…  59.6       40.1      1.49   0.152  
    #> 14     4 <tibble [2… <lm>   <dbl [… x       -83.4       56.5     -1.48   0.155  
    #> 15     4 <tibble [2… <lm>   <dbl [… z2      -18.7       12.8     -1.45   0.160  
    #> 16     4 <tibble [2… <lm>   <dbl [… resids   83.4       56.5      1.48   0.155