Search code examples
rdplyrrlang

mutate and/or summarise a dynamic number of columns


In a previous question I wanted to carry out case_when with a dynamic number of cases. The solution was to use parse_exprs along with !!!. I am looking for a similar solution to mutate/summarise with a dynamic number of columns.

Consider the following dataset.

library(dplyr)
library(rlang)
data(mtcars)
mtcars = mtcars %>%
  mutate(g2 = ifelse(gear == 2, 1, 0),
         g3 = ifelse(gear == 3, 1, 0),
         g4 = ifelse(gear == 4, 1, 0))

Suppose I want to sum the columns g2, g3, g4. If I know these are the columns names then this is simple, standard dplyr:

answer = mtcars %>%
  summarise(sum_g2 = sum(g2),
            sum_g3 = sum(g3),
            sum_g4 = sum(g4))

But suppose I do not know how many columns there are, or their exact names. Instead, I have a vector containing all the column names I care about. Following the logic in the accepted answer of my previous approach I would use:

columns_to_sum = c("g2","g3","g4")

formulas = paste0("sum_",columns_to_sum," = sum(",columns_to_sum,")")

answer = mtcars %>%
  summarise(!!!parse_exprs(formulas))

If this did work, then regardless of the column names provided as input in columns_to_sum, I should receive the sum of the corresponding columns. However, this is not working. Instead of a column named sum_g2 containing sum(g2) I get a column called "sum_g2 = sum(g2)" and every value in this column is a zero.

Given that I can pass formulas into case_when it seems like I should be able to pass formulas into summarise (and the same idea should also work for mutate because they all use the rlang package).

In the past there were string versions of mutate and summarise (mutate_ and summarise_) that you could pass formulas to as strings. But these have been retired as the rlang approach is the intended approach now. The related questions I reviewed on Stackoverflow did not use the rlang quotation approach and hence are not sufficient for my purposes.

How do I summarise with a dynamic number of columns (using an rlang approach)?


Solution

  • One option since dplyr 1.0.0 could be:

    mtcars %>%
     summarise(across(all_of(columns_to_sum), sum, .names = "sum_{col}"))
    
      sum_g2 sum_g3 sum_g4
    1      0     15     12