Search code examples
rregexdplyrmutateacross

How do I mutate across to multiple columns together that have similar names in R?


I have many columns that have same names that always start with the same string, either n_ for the number of students, score_ for the percent of students who passed, and loc_ for the room number.

In this, I want to multiple the n_ columns with their respective score_ columns (so n_math * score_math, n_sci * score_sci, etc.) and create new columns called n_*_success for the number of students who passed the class.

If I had just a few columns like in this sample dataset, I would do something like this for each column:

mutate(n_sci_success = n_sci * score_sci)

But I have many columns and I'd like to write some expression that will match column names.

I think I have to use regex and across (like across(starts_with("n_)), but I just can't figure it out. Any help would be much appreciated!

Here's a sample dataset:

library(tidyverse)

test <- tibble(id = c(1:4),
               n_sci = c(10, 20, 30, 40),
               score_sci = c(1, .9, .75, .7),
               loc_sci = c(1, 2, 3, 4),
               n_math = c(100, 50, 40, 30),
               score_math = c(.5, .6, .7, .8),
               loc_math = c(4, 3, 2, 1),
               n_hist = c(10, 50, 30, 20),
               score_hist = c(.5, .5, .9, .9),
               loc_hist = c(2, 1, 4, 3))


Solution

  • Here's one way using across and new pick function from dplyr 1.1.0

    library(dplyr)
    
    out <- test %>%
      mutate(across(starts_with('n_'), .names = 'res_{col}') * 
               pick(starts_with('score_')) * pick(starts_with('loc_')))
    
    out %>% select(starts_with('res'))
    
    #  res_n_sci res_n_math res_n_hist
    #      <dbl>      <dbl>      <dbl>
    #1      10          200         10
    #2      36           90         25
    #3      67.5         56        108
    #4     112           24         54
    

    This should also work if you replace all pick with across. pick is useful for selecting columns, across is useful when you need to apply a function to the columns selected.

    I am using across in the 1st case (with starts_with('n_')) is because I want to give unique names to the new columns using .names which is not present in pick.