Search code examples
rdataframedplyrlinear-regressionpurrr

Use map() and map2() to run regressions and add fitted values to data frame


I want to regress each of a subset of columns of a data frame on another column of the same data frame (and on top on a subset of observations), and then append the fitted values as new and named columns to the original data frame. As an example, I'll use the EuStockMarkets data that comes with R, transformed into a data frame df.

df <- zoo::fortify.zoo(EuStockMarkets)

I would like to regress the columns DAX, SMI, CAC, and FTSE on Index

> colnames(df)
[1] "Index" "DAX"   "SMI"   "CAC"   "FTSE" 

for values of Index up to Index == 1996.877 using {purrr} functions to avoid a loop. Then, add the fitted values as new columns to df with names DAX_fitted, SMI_fitted, CAC_fitted, and FTSE_fitted.

I came up with two options until now:

fitted <- df %>% 
  select(-Index) %>% 
  names() %>% 
  paste(.,' ~ Index') %>%  
  map(as.formula) %>% 
  map2(., .y = rep(list(df %>% filter(Index < 1996.877)), length(.)), ~ predict(lm(.x, data = .y))) 

which gives me a list of the fitted values or

lm <- df %>% 
    select(-Index) %>% 
    names() %>% 
    paste(.,' ~ Index') %>%  
    map(as.formula) %>% 
    map(lm, data = df %>% filter(Index < 1996.877))

which returns a list lm with regression results.

Ideas on how to complete these code lines to add the named fitted values to the original data frame? Thanks!


Solution

  • With dplyr, you can use mutate() + across() to achieve that:

    library(dplyr)
    
    df %>%
      filter(Index < 1996.877) %>%
      mutate(across(-Index, ~ lm(.x ~ Index)$fitted.values,
                    .names = "{.col}_fitted"))
    
    # # A tibble: 1,400 × 9
    #    Index   DAX   SMI   CAC  FTSE DAX_fitted SMI_fitted CAC_fitted FTSE_fitted
    #    <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>      <dbl>      <dbl>       <dbl>
    #  1 1991. 1629. 1678. 1773. 2444.      1465.      1535.      1865.       2363.
    #  2 1992. 1614. 1688. 1750. 2460.      1465.      1536.      1865.       2364.
    #  3 1992. 1607. 1679. 1718  2448.      1466.      1538.      1865.       2365.
    #  4 1992. 1621. 1684. 1708. 2470.      1467.      1539.      1865.       2366.
    #  5 1992. 1618. 1687. 1723. 2485.      1468.      1541.      1865.       2367.
    #  6 1992. 1611. 1672. 1714. 2467.      1468.      1542.      1865.       2368.
    #  7 1992. 1631. 1683. 1734. 2488.      1469.      1544.      1865.       2369.
    #  8 1992. 1640. 1704. 1757. 2508.      1470.      1545.      1866.       2370.
    #  9 1992. 1635. 1698. 1754  2510.      1471.      1547.      1866.       2371.
    # 10 1992. 1646. 1716. 1754. 2497.      1471.      1548.      1866.       2372.
    # # ℹ 1,390 more rows
    # # ℹ Use `print(n = ...)` to see more rows
    

    To minimally modify the code you have tried, you just need set_names() and map_dfc():

    library(purrr)
    
    df %>%
      select(-Index) %>% 
      names() %>%
      set_names(paste0, "_fitted") %>%
      map_dfc(~ lm(as.formula(paste(.x, "~ Index")), filter(df, Index < 1996.877))$fitted.values)
    
    # # A tibble: 1,400 × 4
    #    DAX_fitted SMI_fitted CAC_fitted FTSE_fitted
    #         <dbl>      <dbl>      <dbl>       <dbl>
    #  1      1465.      1535.      1865.       2363.
    #  2      1465.      1536.      1865.       2364.
    #  3      1466.      1538.      1865.       2365.
    #  4      1467.      1539.      1865.       2366.
    #  5      1468.      1541.      1865.       2367.
    #  6      1468.      1542.      1865.       2368.
    #  7      1469.      1544.      1865.       2369.
    #  8      1470.      1545.      1866.       2370.
    #  9      1471.      1547.      1866.       2371.
    # 10      1471.      1548.      1866.       2372.
    # # ℹ 1,390 more rows
    # # ℹ Use `print(n = ...)` to see more rows