Search code examples
rdplyrsummarysubtotal

dplyr summarize with subtotals


One of the great things about pivot tables in excel is that they provide subtotals automatically. First, I would like to know if there is anything already created within dplyr that can accomplish this. If not, what is the easiest way to achieve it?

In the example below, I show the mean displacement by number of cylinders and carburetors. For each group of cylinders (4,6,8), I'd like to see the mean displacement for the group (or total displacement, or any other summary statistic).

library(dplyr)
mtcars %>% group_by(cyl,carb) %>% summarize(mean(disp))

  cyl carb mean(disp)
1   4    1      91.38
2   4    2     116.60
3   6    1     241.50
4   6    4     163.80
5   6    6     145.00
6   8    2     345.50
7   8    3     275.80
8   8    4     405.50
9   8    8     301.00

Solution

  • data.table It's very clunky, but this is one way:

    library(data.table)
    DT <- data.table(mtcars)
    rbind(
      DT[,.(mean(disp)),          by=.(cyl,carb)],
      DT[,.(mean(disp), carb=NA), by=.(cyl) ],
      DT[,.(mean(disp), cyl=NA),  by=.(carb)]
    )[order(cyl,carb)]
    

    This gives

        cyl carb       V1
     1:   4    1  91.3800
     2:   4    2 116.6000
     3:   4   NA 105.1364
     4:   6    1 241.5000
     5:   6    4 163.8000
     6:   6    6 145.0000
     7:   6   NA 183.3143
     8:   8    2 345.5000
     9:   8    3 275.8000
    10:   8    4 405.5000
    11:   8    8 301.0000
    12:   8   NA 353.1000
    13:  NA    1 134.2714
    14:  NA    2 208.1600
    15:  NA    3 275.8000
    16:  NA    4 308.8200
    17:  NA    6 145.0000
    18:  NA    8 301.0000
    

    I'd rather see results in something like an R table, but don't know of any functions for that.


    dplyr @akrun found this analogous code

    bind_rows(
      mtcars %>% 
        group_by(cyl, carb) %>% 
        summarise(Mean= mean(disp)), 
      mtcars %>% 
        group_by(cyl) %>% 
        summarise(carb=NA, Mean=mean(disp)), 
      mtcars %>% 
        group_by(carb) %>% 
        summarise(cyl=NA, Mean=mean(disp))
    ) %>% arrange(cyl, carb)
    

    We could wrap the repeat operations in a function

    library(lazyeval)
    f1 <- function(df, grp, Var, func){
      FUN <- match.fun(func)
       df %>% 
         group_by_(.dots=grp) %>%
         summarise_(interp(~FUN(v), v=as.name(Var)))
      }
    
     m1 <- f1(mtcars, c('carb', 'cyl'), 'disp', 'mean')
     m2 <- f1(mtcars, 'carb', 'disp', 'mean')
     m3 <- f1(mtcars, 'cyl', 'disp', 'mean')
    
     bind_rows(list(m1, m2, m3)) %>%
                  arrange(cyl, carb) %>%
                  rename(Mean=`FUN(disp)`)
       carb cyl     Mean
    1     1   4  91.3800
    2     2   4 116.6000
    3    NA   4 105.1364
    4     1   6 241.5000
    5     4   6 163.8000
    6     6   6 145.0000
    7    NA   6 183.3143
    8     2   8 345.5000
    9     3   8 275.8000
    10    4   8 405.5000
    11    8   8 301.0000
    12   NA   8 353.1000
    13    1  NA 134.2714
    14    2  NA 208.1600
    15    3  NA 275.8000
    16    4  NA 308.8200
    17    6  NA 145.0000
    18    8  NA 301.0000
    

    Either option can be made a little less ugly with data.table's rbindlist with fill:

    rbindlist(list(
      mtcars %>% group_by(cyl) %>% summarise(mean(disp)),
      mtcars %>% group_by(carb) %>% summarise(mean(disp)),
      mtcars %>% group_by(cyl,carb) %>% summarise(mean(disp))
    ),fill=TRUE) %>% arrange(cyl,carb)
    
    rbindlist(list(
      DT[,mean(disp),by=.(cyl,carb)],
      DT[,mean(disp),by=.(cyl)],
      DT[,mean(disp),by=.(carb)]
    ),fill=TRUE)[order(cyl,carb)]