Search code examples
rdplyrpurrr

Purrr with group_by


Using purrr, I would like to modify the following example from Advanced R to calculate the mean of each variable in the mtcars data, split by cyl:

   by_cyl <- split(mtcars, mtcars$cyl)
    by_cyl %>% 
      map(~ lm(mpg ~ wt, data = .x)) %>% 
      map(coef) %>% 
      map_dbl(2)

I can do this for a specific value of cyl:

mtcars %>% 
  filter(cyl ==8) %>% 
  map_df(mean)

But this does not work:

by_cyl %>% 
  map_df(~mean(.x, na.rm = TRUE))

I guess it's because I'm passing mean over a whole dataframe, instead of a vector, but I don't know how to fix this.


Solution

  • Another option is to do nested calls to map()/map_df():

    library("purrr")
    library("magrittr")
    #> 
    #> Attaching package: 'magrittr'
    #> The following object is masked from 'package:purrr':
    #> 
    #>     set_names
    by_cyl <- split(mtcars, mtcars$cyl)
    by_cyl %>% map_df(map_df, mean, na.rm = TRUE)
    #> # A tibble: 3 x 11
    #>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
    #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1  26.7     4  105.  82.6  4.07  2.29  19.1 0.909 0.727  4.09  1.55
    #> 2  19.7     6  183. 122.   3.59  3.12  18.0 0.571 0.429  3.86  3.43
    #> 3  15.1     8  353. 209.   3.23  4.00  16.8 0     0.143  3.29  3.5
    

    Created on 2023-06-28 with reprex v2.0.2

    Basically a list of data frames (what you obtain after using split()) is a list of lists. So you have to maps across the list you get from split() and then the data frames in that list.

    EDIT: The syntax for the inner part of the first map_df() is a simpler way of specifying ~(.x %>% map_df(~(mean(.x, na.rm = TRUE))) that uses the ... or "dots" argument of map_df() to pass in the arguments for the inner map_df().

    But this approach is much slower than Quinten's answer:

    library("purrr")
    library("magrittr")
    #> 
    #> Attaching package: 'magrittr'
    #> The following object is masked from 'package:purrr':
    #> 
    #>     set_names
    by_cyl <- split(mtcars, mtcars$cyl)
    bench::mark(by_cyl %>% map_df(map_df, mean, na.rm = TRUE),
                by_cyl %>% map_df(colMeans))
    #> # A tibble: 2 x 6
    #>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
    #> 1 by_cyl %>% map_df(map_df, mean, ~ 14.39ms  19.7ms      47.7    3.87MB     6.81
    #> 2 by_cyl %>% map_df(colMeans)        3.68ms  4.57ms     192.   153.88KB     8.82