Search code examples
rdplyrrowwiseacross

Mutate, row-wise, based on matching strings or NA across subset of columns


Any advice how to match strings, within a row, across multiple columns?

Adapted from Remove rows where all variables are NA using dplyr where they are matching only NAs across columns, and filtering those - not creating a new variable.

Toy example:

library(dplyr)
df <- tibble(a = c('a', 'a', 'a', NA), 
             b1 = c('b', 'c', NA, NA), 
             b2 = c('d', NA, NA, NA),
             b3 = c('e', NA, NA, NA),
             b4 = c('f', NA, NA, NA))
df

# A tibble: 4 x 5
  a     b1    b2    b3    b4   
  <chr> <chr> <chr> <chr> <chr>
1 a     b     d     e     f    
2 a     c     NA    NA    NA   
3 a     NA    NA    NA    NA   
4 NA    NA    NA    NA    NA 

To create a new variable all_na if the whole row is NA:

df %>% 
  rowwise() %>% 
  mutate(all_na = all(is.na(across())))


# A tibble: 4 x 6
# Rowwise: 
  a     b1    b2    b3    b4    all_na
  <chr> <chr> <chr> <chr> <chr> <lgl> 
1 a     b     d     e     f     FALSE 
2 a     c     NA    NA    NA    FALSE 
3 a     NA    NA    NA    NA    FALSE 
4 NA    NA    NA    NA    NA    TRUE   

To create a new variable if just a subset of the columns (starting with 'b') is NA b_is_na

df %>% 
  rowwise() %>% 
  mutate(b_is_na = all(is.na(across(starts_with('b'))))) %>% 
  ungroup()

# A tibble: 4 x 6
  a     b1    b2    b3    b4    b_is_na
  <chr> <chr> <chr> <chr> <chr> <lgl>  
1 a     b     d     e     f     FALSE  
2 a     c     NA    NA    NA    FALSE  
3 a     NA    NA    NA    NA    TRUE   
4 NA    NA    NA    NA    NA    TRUE   

Question:

However, I'm not sure how to create a variable if within a row, for a subset of columns is a string match OR NA, for example, 'c' or NA

Desired output:

# A tibble: 4 x 6
  a     b1    b2    b3    b4    b_is_na
  <chr> <chr> <chr> <chr> <chr> <lgl>  
1 a     b     d     e     f     FALSE  
2 a     c     NA    NA    NA    TRUE  
3 a     NA    NA    NA    NA    TRUE   
4 NA    NA    NA    NA    NA    TRUE   

Solution

  • A base R option and an efficient vectorized option would be rowSums on a logical matrix

    nm1 <- startsWith(names(df), 'b')
    df$b_is_na <- rowSums(df[nm1] == 'c'|is.na(df[nm1])) > 0
    df$b_is_na
    #[1] FALSE  TRUE  TRUE  TRUE
    

    It can be also used with the mutate

    library(dplyr)
    df %>%
      mutate(b_is_na = rowSums(select(., starts_with('b')) == 
                 'c'|is.na(select(., starts_with('b')))) > 0)
    # A tibble: 4 x 6
    #  a     b1    b2    b3    b4    b_is_na
    #  <chr> <chr> <chr> <chr> <chr> <lgl>  
    #1 a     b     d     e     f     FALSE  
    #2 a     c     <NA>  <NA>  <NA>  TRUE   
    #3 a     <NA>  <NA>  <NA>  <NA>  TRUE   
    #4 <NA>  <NA>  <NA>  <NA>  <NA>  TRUE 
    

    NOTE: Using rowwise would be an inefficient way

    Or with c_across, but it may not be that optimal

    df %>% 
       rowwise %>%
       mutate(b_is_na = {
            tmp <- c_across(starts_with('b'))
             any(is.na(tmp)|tmp == 'c') }) %>%
       ungroup
    # A tibble: 4 x 6
    #  a     b1    b2    b3    b4    b_is_na
    #  <chr> <chr> <chr> <chr> <chr> <lgl>  
    #1 a     b     d     e     f     FALSE  
    #2 a     c     <NA>  <NA>  <NA>  TRUE   
    #3 a     <NA>  <NA>  <NA>  <NA>  TRUE   
    #4 <NA>  <NA>  <NA>  <NA>  <NA>  TRUE