I get a weird data structure when grouping by several columns and summarizing several columns of in dplyr. The data frame is large and the weirdness of the resulting data strucutre is more significant but below create a small version of the problem.
Everything is fine:
library(dplyr)
df <- data.frame(A = c(1,1,2,2), B = c(1,1,2,2), C = c(10,20,30,40), D = c(1000,2000,3000,4000))
df %>% group_by(A) %>% summarize(C = sum(C),D = sum(D)) %>% str()
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
$ A: num 1 2
$ C: num 30 70
$ D: num 3000 7000
What is this?
df %>% group_by(A,B) %>% summarize(C = sum(C),D = sum(D)) %>% str()
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
$ A: num 1 2
$ B: num 1 2
$ C: num 30 70
$ D: num 3000 7000
- attr(*, "vars")=List of 1
..$ : symbol A
- attr(*, "drop")= logi TRUE
The group_by
creates some additional attributes. If we don't need those attributes, then ungroup
after summarise
is one option
df %>%
group_by(A, B) %>%
summarize(C = sum(C),D = sum(D)) %>%
ungroup() %>%
str()
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# $ A: num 1 2
# $ B: num 1 2
# $ C: num 30 70
# $ D: num 3000 7000