Search code examples
rdplyr

R: Use character vector with column prefixes to create new columns summing values across the prefixed columns


I have a large dataset with the following rough structure, where several columns share prefixes:

dataset <- data.frame(a1 =  c(1:10), a2 = c(11:20),b1=c(21:30),b2=c(31:40))

I want to create new columns with the row sum across each group of columns which share prefixes , so that it ends up looking like this

   a1 a2 b1 b2 a_sum b_sum
1   1 11 21 31    12    52
2   2 12 22 32    14    54
3   3 13 23 33    16    56
4   4 14 24 34    18    58
5   5 15 25 35    20    60
6   6 16 26 36    22    62
7   7 17 27 37    24    64
8   8 18 28 38    26    66
9   9 19 29 39    28    68
10 10 20 30 40    30    70

My preferred method would be somehow feed the character vector of prefixes vec <- c("a","b") into a mutate function which fits in a larger pipe workflow.

I have been wondering of ways to use across with matches, perhaps with some sort of lapply call to perform it for each element of my vector, but I fall short. I would prefer to avoid solutions which involve reshaping the data, as it is quite large and there are many columns which don't relate to this specific issue.


Solution

  • We can use map or lapply over the vec vector to create the desired one-column tibbles, and finally bind_cols() to bind them at the end of the original data.frame. To select columns matching a character prefix, we should usually use the starts_with selection helper. The walrus (:=) operator allows us to dynamically generate column names as desired inside tibble.

    library(dplyr)
    library(purrr)
    
    vec <- c("a","b")
    
    dataset |>
        bind_cols(vec |>
                      map(\(x) tibble("{x}_sum" := rowSums(
                          select(dataset, starts_with(x))
                      ))))
    
       a1 a2 b1 b2 a_sum b_sum
    1   1 11 21 31    12    52
    2   2 12 22 32    14    54
    3   3 13 23 33    16    56
    4   4 14 24 34    18    58
    5   5 15 25 35    20    60
    6   6 16 26 36    22    62
    7   7 17 27 37    24    64
    8   8 18 28 38    26    66
    9   9 19 29 39    28    68
    10 10 20 30 40    30    70
    

    Edit

    To enable upstream transformations to feed into this operation seamlessly, we can adapt the code to utilize the current data via pick rather than relying on external provision through select

    dataset |> 
        mutate(bind_cols(vec |>
                      map(\(x) tibble("{x}_sum" := rowSums(
                          pick(starts_with(x))
                      )))))