Search code examples
rfor-loopdplyrsummarize

Use summarize and a for loop taking column names from a character vector


I have a dataset which I cannot share here, but I need to create columns using a for loop and the column names should come from a character vector. Below I try to replicate what I am trying to achieve using the flights dataset from the nycflights13 package.

install.packages("nycflights13")
library(nycflights13)

flights <- nycflights13::flights
flights <- flights[c(10, 16, 17)]

var_interest <- c("distance", "hour")

for (i in 1:length(var_interest)) {
  flights %>% group_by(carrier) %>%
    summarize(paste(var_interest[i], "n", sep = "_") = sum(paste(var_interest[i])))
}

This code generates the following error:

Error: unexpected '=' in:
"  flights %>% group_by(carrier) %>%
    summarize(paste(var_interest[i], "n", sep = "_") ="
> }
Error: unexpected '}' in "}"

My actual dataset is more complex than this example and therefore, I need to follow this approach. So if you could help me find what I am missing here, that would be highly appreciated!


Solution

  • The code can be modified to evaluate (!!) the column after converting the string to symbol, while on the lhs of assignment (:=) do the evaluation (!!) of string as well

    out <- vector('list', length(var_interest))
    for (i in seq_along(var_interest)) {
    out[[i]] <- flights %>%
       group_by(carrier) %>%
       summarize(!! paste(var_interest[i], "n", sep = "_") := 
           sum(!! rlang::sym(var_interest[i])), .groups = 'drop')
     }
    
    
    lapply(out, head, 3)
    #[[1]]
    # A tibble: 3 x 2
    #  carrier distance_n
    #  <chr>        <dbl>
    #1 9E         9788152
    #2 AA        43864584
    #3 AS         1715028
    
    #[[2]]
    # A tibble: 3 x 2
    #  carrier hour_n
    #  <chr>    <dbl>
    #1 9E      266419
    #2 AA      413361
    #3 AS        9013
    

    There are multiple ways to pass a string column name and evaluate it.

    1. As above stated, convert to a symbol and evaluate (!!).
    2. Make use of across which can take either unquoted, or string or column index as integer i.e. In that case, we don't even need any loop

    flights %>%
          group_by(carrier) %>%
          summarise(across(all_of(var_interest), ~ 
                   sum(., na.rm = TRUE), .names = '{.col}_n'), 
                .groups = 'drop') 
    # A tibble: 16 x 3
    #   carrier distance_n hour_n
    #   <chr>        <dbl>  <dbl>
    # 1 9E         9788152 266419
    # 2 AA        43864584 413361
    # 3 AS         1715028   9013
    # 4 B6        58384137 747278
    # 5 DL        59507317 636932
    # 6 EV        30498951 718187
    # 7 F9         1109700   9441
    # 8 FL         2167344  43960
    # 9 HA         1704186   3324
    #10 MQ        15033955 358779
    #11 OO           16026    550
    #12 UA        89705524 754410
    #13 US        11365778 252595
    #14 VX        12902327  63876
    #15 WN        12229203 151366
    #16 YV          225395   9300