I'm trying to use dplyr::summarize() and dplyr::across() to obtain a tibble with several summary statistics in the rows and the variables in the columns. I was only able to achieve this result by using dplyr::bind_rows(), but I'm wondering if there's a more elegant way to get the same output.
> library(tidyverse)
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3 ✔ purrr 0.3.4
✔ tibble 3.1.1 ✔ dplyr 1.0.6
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 1.4.0 ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
>
> bind_rows(min = summarize(starwars, across(where(is.numeric), min,
+ na.rm = TRUE)),
+ median = summarize(starwars, across(where(is.numeric), median,
+ na.rm = TRUE)),
+ mean = summarize(starwars, across(where(is.numeric), mean, na.rm = TRUE)),
+ max = summarize(starwars, across(where(is.numeric), max, na.rm = TRUE)),
+ sd = summarize(starwars, across(where(is.numeric), sd, na.rm = TRUE)),
+ .id = "statistic")
# A tibble: 5 x 4
statistic height mass birth_year
<chr> <dbl> <dbl> <dbl>
1 min 66 15 8
2 median 180 79 52
3 mean 174. 97.3 87.6
4 max 264 1358 896
5 sd 34.8 169. 155.
Why can't one do it with summarize directly? Seems more elegant than using a list of functions, as suggested by the colwise vignette. Does this violate the principles of a tidy data frame? (It seems to me that staking a bunch of data frames besides one another is far less tidy.)
Here is a way using purrr
to iterate over a list of functions. This is effectively what you were doing with bind_rows()
, but in less code.
library(dplyr)
library(purrr)
funs <- lst(min, median, mean, max, sd)
map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")
# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.