Search code examples
rdplyrsplit-apply-combine

How to add totals as well as group_by statistics in R


When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both?

I am looking for something clean and short. Until now I can only think of:

bind_rows( 
  iris %>% group_by(Species) %>% summarise(
    "Mean" = mean(Sepal.Width), 
    "Median" = median(Sepal.Width), 
    "sd" = sd(Sepal.Width), 
    "p10" = quantile(Sepal.Width, probs = 0.1))
  , 
  iris %>% summarise(
    "Mean" = mean(Sepal.Width), 
    "Median" = median(Sepal.Width), 
    "sd" = sd(Sepal.Width), 
    "p10" = quantile(Sepal.Width, probs = 0.1)) %>% 
  mutate(Species = "Total")
  )

But I would like something more compact. In particular, I don't want to type the code (for summarize) twice, once for each group and once for the total.


Solution

  • You can simplify it if you untangle what you're trying to do: you have iris data that has several species, and you want that summarized along with data for all species. You don't need to calculate those summary stats before you can bind. Instead, bind iris with a version of iris that's been set to Species = "Total", then group and summarize.

    library(tidyverse)
    
    bind_rows(
      iris,
      iris %>% mutate(Species = "Total")
    ) %>%
      group_by(Species) %>%
      summarise(Mean = mean(Sepal.Width),
                Median = median(Sepal.Width),
                sd = sd(Sepal.Width),
                p10 = quantile(Sepal.Width, probs = 0.1))
    #> # A tibble: 4 x 5
    #>   Species     Mean Median    sd   p10
    #>   <chr>      <dbl>  <dbl> <dbl> <dbl>
    #> 1 setosa      3.43    3.4 0.379  3   
    #> 2 Total       3.06    3   0.436  2.5 
    #> 3 versicolor  2.77    2.8 0.314  2.3 
    #> 4 virginica   2.97    3   0.322  2.59
    

    I like the caution in the comments above, though I have to do this sort of calculation for work enough that I have a similar shorthand function in a personal package. It perhaps makes less sense for things like standard deviations, but it's something I need to do a lot for adding up totals of demographic groups, etc. (If it's useful, that function is here).