Search code examples
rdplyrtidyversesummarizeacross

Summary statistics for multiple variables with statistics as rows and variables as columns?


I'm trying to use dplyr::summarize() and dplyr::across() to obtain a tibble with several summary statistics in the rows and the variables in the columns. I was only able to achieve this result by using dplyr::bind_rows(), but I'm wondering if there's a more elegant way to get the same output.

> library(tidyverse)
── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.3     ✔ purrr   0.3.4
✔ tibble  3.1.1     ✔ dplyr   1.0.6
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   1.4.0     ✔ forcats 0.5.1
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
> 
> bind_rows(min = summarize(starwars, across(where(is.numeric), min, 
+       na.rm = TRUE)), 
+   median = summarize(starwars, across(where(is.numeric), median, 
+       na.rm = TRUE)), 
+   mean = summarize(starwars, across(where(is.numeric), mean, na.rm = TRUE)), 
+   max = summarize(starwars, across(where(is.numeric), max, na.rm = TRUE)), 
+   sd = summarize(starwars, across(where(is.numeric), sd, na.rm = TRUE)), 
+   .id = "statistic")
# A tibble: 5 x 4
  statistic height   mass birth_year
  <chr>      <dbl>  <dbl>      <dbl>
1 min         66     15          8  
2 median     180     79         52  
3 mean       174.    97.3       87.6
4 max        264   1358        896  
5 sd          34.8  169.       155. 

Why can't one do it with summarize directly? Seems more elegant than using a list of functions, as suggested by the colwise vignette. Does this violate the principles of a tidy data frame? (It seems to me that staking a bunch of data frames besides one another is far less tidy.)


Solution

  • Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

    library(dplyr)
    library(purrr)
    
    funs <- lst(min, median, mean, max, sd)
    
    map_dfr(funs,
            ~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
            .id = "statistic")
    
    # # A tibble: 5 x 4
    #   statistic height   mass birth_year
    #   <chr>      <dbl>  <dbl>      <dbl>
    # 1 min         66     15          8  
    # 2 median     180     79         52  
    # 3 mean       174.    97.3       87.6
    # 4 max        264   1358        896  
    # 5 sd          34.8  169.       155.