Search code examples
rdplyrsummarize

R: dplyr gives strange data structure only when grouping by more than one column


I get a weird data structure when grouping by several columns and summarizing several columns of in dplyr. The data frame is large and the weirdness of the resulting data strucutre is more significant but below create a small version of the problem.

Everything is fine:

library(dplyr)
df <- data.frame(A = c(1,1,2,2), B = c(1,1,2,2), C = c(10,20,30,40), D = c(1000,2000,3000,4000))
df %>% group_by(A) %>% summarize(C = sum(C),D = sum(D)) %>% str()
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2 obs. of  3 variables:
 $ A: num  1 2
 $ C: num  30 70
 $ D: num  3000 7000

What is this?

df %>% group_by(A,B) %>% summarize(C = sum(C),D = sum(D)) %>% str()
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  4 variables:
 $ A: num  1 2
 $ B: num  1 2
 $ C: num  30 70
 $ D: num  3000 7000
 - attr(*, "vars")=List of 1
  ..$ : symbol A
 - attr(*, "drop")= logi TRUE

Solution

  • The group_by creates some additional attributes. If we don't need those attributes, then ungroup after summarise is one option

    df %>% 
       group_by(A, B) %>%
       summarize(C = sum(C),D = sum(D)) %>%
       ungroup() %>%
       str()
    #Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2 obs. of  4 variables:
    # $ A: num  1 2
    # $ B: num  1 2
    # $ C: num  30 70
    # $ D: num  3000 7000