Search code examples
rdplyrprefixsuffix

dplyr mutate based on other column with different suffix


I have a dataframe similar to this one:

df <- data.frame(a_1_1 = c(1, 0, 1), a_1_2=c(1,0,0),a_2_1=c(1,0,0), a_2_2=c(1,0 ,1), b=c(2,3,4))

I will like to create new variables by comparing the columns with the same prefix, in the following way:

df <- df %>% mutate(a_1=case_when((a_1_1==1 | a_1_2==1)~"A", TRUE ~ "B")) %>% 
mutate(a_2=case_when((a_2_1==1 | a_2_2==1)~"A", TRUE ~ "B"))

However in my real data, I have many variables starting with "a_*", so I will like to create these variables without doing for each variable once by once.


Solution

  • An option with across by looping over the columns that starts with 'a' followed by a _ and some digits, then _ and the number 1 at the end ($) of the string, use case_when on the that column (.) and the value returned (get) by the corresponding column by changing the column name (cur_column()) with str_replace, specify the suffix of the new column as _new, then in the next step rename those columns with rename_with

    library(dplyr)
    library(stringr)
    df %>% 
      mutate(across(matches('^a_\\d+_1$'), 
       ~ case_when(. == 1| get(str_replace(cur_column(), '_\\d+$', '_2')) == 1 ~ 'A',
          TRUE ~ 'B'), .names = '{.col}_new')) %>%
      rename_with(~ str_remove(., '_\\d+_new'), ends_with('new'))
    

    -output

    #  a_1_1 a_1_2 a_2_1 a_2_2 b a_1 a_2
    #1     1     1     1     1 2   A   A
    #2     0     0     0     0 3   B   B
    #3     1     0     0     1 4   A   A
    

    Or another option is to use pivot_longer to reshape into 'long' format and make it easier to do the comparison to create new columns, reshape it back to wide format with pivot_wider and then bind those new columns to original data

    library(tidyr)
    df %>%
      select(-b) %>% 
      mutate(rn = row_number()) %>%
      pivot_longer(cols = -rn, names_to = c('grp', '.value'),
          names_sep = "_(?=\\d+$)") %>% 
      transmute(rn, grp, val = case_when(`1` == 1|`2` == 1 ~ 'A',
           TRUE ~ 'B')) %>% 
      pivot_wider(names_from = grp, values_from = val) %>% 
      select(-rn) %>% 
      bind_cols(df, .)
    

    Or using base R with split.default

    df[paste0("a_", 1:2)] <- ifelse(
         sapply(split.default(df[startsWith(names(df), "a_")],  
         sub("_\\d+$", "", grep("^a_", names(df), value = TRUE))),
          rowSums) > 0, 'A', 'B')
    

    Or using a for loop

    nm1 <- unique(sub("_\\d+$", "", grep('^a_\\d+', names(df), value = TRUE)))
    for(nm in nm1) df[[nm]] <- ifelse(rowSums(df[startsWith(names(df), 
          nm)]) > 0, "A", "B")