Any advice how to match strings, within a row, across multiple columns?
Adapted from Remove rows where all variables are NA using dplyr where they are matching only NAs across columns, and filtering those - not creating a new variable.
Toy example:
library(dplyr)
df <- tibble(a = c('a', 'a', 'a', NA),
b1 = c('b', 'c', NA, NA),
b2 = c('d', NA, NA, NA),
b3 = c('e', NA, NA, NA),
b4 = c('f', NA, NA, NA))
df
# A tibble: 4 x 5
a b1 b2 b3 b4
<chr> <chr> <chr> <chr> <chr>
1 a b d e f
2 a c NA NA NA
3 a NA NA NA NA
4 NA NA NA NA NA
To create a new variable all_na
if the whole row is NA:
df %>%
rowwise() %>%
mutate(all_na = all(is.na(across())))
# A tibble: 4 x 6
# Rowwise:
a b1 b2 b3 b4 all_na
<chr> <chr> <chr> <chr> <chr> <lgl>
1 a b d e f FALSE
2 a c NA NA NA FALSE
3 a NA NA NA NA FALSE
4 NA NA NA NA NA TRUE
To create a new variable if just a subset of the columns (starting with 'b') is NA b_is_na
df %>%
rowwise() %>%
mutate(b_is_na = all(is.na(across(starts_with('b'))))) %>%
ungroup()
# A tibble: 4 x 6
a b1 b2 b3 b4 b_is_na
<chr> <chr> <chr> <chr> <chr> <lgl>
1 a b d e f FALSE
2 a c NA NA NA FALSE
3 a NA NA NA NA TRUE
4 NA NA NA NA NA TRUE
However, I'm not sure how to create a variable if within a row, for a subset of columns is a string match OR NA
, for example, 'c' or NA
Desired output:
# A tibble: 4 x 6
a b1 b2 b3 b4 b_is_na
<chr> <chr> <chr> <chr> <chr> <lgl>
1 a b d e f FALSE
2 a c NA NA NA TRUE
3 a NA NA NA NA TRUE
4 NA NA NA NA NA TRUE
A base R
option and an efficient vectorized option would be rowSums
on a logical matrix
nm1 <- startsWith(names(df), 'b')
df$b_is_na <- rowSums(df[nm1] == 'c'|is.na(df[nm1])) > 0
df$b_is_na
#[1] FALSE TRUE TRUE TRUE
It can be also used with the mutate
library(dplyr)
df %>%
mutate(b_is_na = rowSums(select(., starts_with('b')) ==
'c'|is.na(select(., starts_with('b')))) > 0)
# A tibble: 4 x 6
# a b1 b2 b3 b4 b_is_na
# <chr> <chr> <chr> <chr> <chr> <lgl>
#1 a b d e f FALSE
#2 a c <NA> <NA> <NA> TRUE
#3 a <NA> <NA> <NA> <NA> TRUE
#4 <NA> <NA> <NA> <NA> <NA> TRUE
NOTE: Using rowwise
would be an inefficient way
Or with c_across
, but it may not be that optimal
df %>%
rowwise %>%
mutate(b_is_na = {
tmp <- c_across(starts_with('b'))
any(is.na(tmp)|tmp == 'c') }) %>%
ungroup
# A tibble: 4 x 6
# a b1 b2 b3 b4 b_is_na
# <chr> <chr> <chr> <chr> <chr> <lgl>
#1 a b d e f FALSE
#2 a c <NA> <NA> <NA> TRUE
#3 a <NA> <NA> <NA> <NA> TRUE
#4 <NA> <NA> <NA> <NA> <NA> TRUE