Search code examples
rloopssum

how to sum the value for multiple variables by the same group in r


I need to sum the values for about 40 variables by the same group.

This is an example dataset. So I wanted to sum the values of score1-score5 by region and department.

region <- rep(c("south", "east", "west", "north"),times=10)
department <- rep(c("A",  "B","C","D","E"),times=8)
score1 <- rnorm(n = 40, mean = 0, sd = 1)
score2 <-rnorm(n = 40, mean = 3, sd = 1.5)
score3 <-rnorm(n = 40, mean = 2, sd = 1)
score4 <-rnorm(n = 40, mean = 1, sd = 1.5)
score5 <-rnorm(n = 40, mean = 5, sd = 1.5)
df <- data.frame(region, department, score1, score2, score3, score4, score5)

This is the code that would lead to the resutls I wanted to have but is there any easier ways to do this:

df %>% group_by(region, department) %>%
       summarise(score1=sum(score1), 
                score2=sum(score2),
                score3=sum(score3),
                score4=sum(score4),
                score5=sum(score5))

I tried to use a loop but this didn't work:

vlist<-c("score1", "score2", "score3", "score4", "score5")
for (var in vlist) {
  df<-df %>% group_by(region, department) %>%
    summarise(var=sum(.[[var]]))
}

Is there any other ways or what is wrong with my loop? Thanks!


Solution

  • Use across - loop across the columns that starts_with 'score' and get the sum

    library(dplyr)
    out1 <- df %>% 
       group_by(region, department) %>% 
       summarise(across(starts_with('score'), sum), .groups = 'drop')
    

    In the for loop, the issue is that df is getting updated (df <-..) in each iteration and summarise returns only the columns provided in the group by and the summarised output. Thus, after the first iteration, 'df' wouldn't have the 'score' columns at all. If we want to use a for loop, get the output in a list and then reduce with a join

    library(purrr) 
    out_list <- vector('list', length(vlist))
    names(out_list) <- vlist
    for (var in vlist) {
        out_list[[var]] <- df %>% 
                group_by(region, department) %>%
                summarise(!!var := sum(cur_data()[[var]]), .groups = 'drop')
       }
    out2 <- reduce(out_list, full_join, by = c('region', 'department'))
    

    -checking the outputs

    > identical(out1, out2)
    [1] TRUE