Search code examples
rdplyr

dplyr - summary table for multiple variables


How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.


Solution

  • Use dplyr in combination with tidyr to reshape the end result.

    library(dplyr)
    library(tidyr)
    
    df <- tbl_df(mtcars)
    
    df.sum <- df %>%
      select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
      summarise_each(funs(min = min, 
                          q25 = quantile(., 0.25), 
                          median = median, 
                          q75 = quantile(., 0.75), 
                          max = max,
                          mean = mean, 
                          sd = sd))
    
    # the result is a wide data frame
    > dim(df.sum)
    [1]  1 42
    
    # reshape it using tidyr functions
    
    df.stats.tidy <- df.sum %>% gather(stat, val) %>%
      separate(stat, into = c("var", "stat"), sep = "_") %>%
      spread(stat, val) %>%
      select(var, min, q25, median, q75, max, mean, sd) # reorder columns
    
    > print(df.stats.tidy)
    
       var  min    q25 median  q75  max     mean        sd
    1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
    2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
    3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
    4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
    5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
    6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161