Using purrr
, I would like to modify the following example from Advanced R to calculate the mean of each variable in the mtcars data, split by cyl
:
by_cyl <- split(mtcars, mtcars$cyl)
by_cyl %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map(coef) %>%
map_dbl(2)
I can do this for a specific value of cyl
:
mtcars %>%
filter(cyl ==8) %>%
map_df(mean)
But this does not work:
by_cyl %>%
map_df(~mean(.x, na.rm = TRUE))
I guess it's because I'm passing mean
over a whole dataframe, instead of a vector, but I don't know how to fix this.
Another option is to do nested calls to map()
/map_df()
:
library("purrr")
library("magrittr")
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
by_cyl <- split(mtcars, mtcars$cyl)
by_cyl %>% map_df(map_df, mean, na.rm = TRUE)
#> # A tibble: 3 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 26.7 4 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
#> 2 19.7 6 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
#> 3 15.1 8 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
Created on 2023-06-28 with reprex v2.0.2
Basically a list of data frames (what you obtain after using split()
) is a list of lists. So you have to maps across the list you get from split()
and then the data frames in that list.
EDIT: The syntax for the inner part of the first map_df()
is a simpler way of specifying ~(.x %>% map_df(~(mean(.x, na.rm = TRUE)))
that uses the ...
or "dots" argument of map_df()
to pass in the arguments for the inner map_df()
.
But this approach is much slower than Quinten's answer:
library("purrr")
library("magrittr")
#>
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#>
#> set_names
by_cyl <- split(mtcars, mtcars$cyl)
bench::mark(by_cyl %>% map_df(map_df, mean, na.rm = TRUE),
by_cyl %>% map_df(colMeans))
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 by_cyl %>% map_df(map_df, mean, ~ 14.39ms 19.7ms 47.7 3.87MB 6.81
#> 2 by_cyl %>% map_df(colMeans) 3.68ms 4.57ms 192. 153.88KB 8.82