Search code examples
rgroup-bysummarize

How to sum a set of columns grouped by one column


I have a dataframe like so

ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)     
cluster <- c(1,2,3,1)

test <- data.frame(ID, Type1, Type2, cluster)

I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.

I achieved it through

test.sum <- test %>%
  group_by(cluster)%>%
  summarise(sum(Type1), sum(Type2)) 

However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?


Solution

  • This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:

    test %>% 
        group_by(cluster) %>% 
        summarise(across(contains("Type"), sum))
    
      cluster Type1 Type2
        <dbl> <dbl> <dbl>
    1       1     2     0
    2       2     1     1
    3       3     0     1
    

    Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:

    library(dplyr)
    library(tidyr)
    
    test %>% 
        pivot_longer(-c(ID, cluster)) %>% 
        group_by(cluster, name) %>% 
        summarise(sum_value = sum(value)) %>% 
        pivot_wider(names_from = "name", values_from = "sum_value")
    
      cluster Type1 Type2
        <dbl> <dbl> <dbl>
    1       1     2     0
    2       2     1     1
    3       3     0     1