Search code examples
rstringdetect

Find if values across multiple columns are present within a string


I am trying to determine whether drug names which are spread across multiple columns, are present within a string. I am using str_detect, and eventually need to refer to column indices as I don't know how many columns I need to use to search.

Example data:

string.to.search<-c("abc","bc","cd","e","f")
Drug1<-c("b","c","e","f",NA)
Drug2<-c("c","d",NA,NA,NA)
Drug3<-c(NA,NA,NA,NA,NA)

df<-as.data.frame(cbind(string.to.search, Drug1, Drug2, Drug3))

I can do this easily enough when referring to the columns I wish to use as search terms (Drug1 to Drug3 in this instance), and achieve the desired result:

df.new<-df%>%
  mutate(Found = str_detect(string.to.search, paste(Drug1, Drug2, Drug3, sep="|")))
String.to.search Drug1 Drug2 Drug3 Found
a b c b c TRUE
b c c d TRUE
c d e FALSE
e f FALSE
f FALSE

I have found it is possible to use a column index, but can only make this work for a single column:

df.new2<-df%>%
  mutate(Found = str_detect(string.to.search, .[[2]]))
String.to.search Drug1 Drug2 Drug3 Found
a b c b c TRUE
b c c d TRUE
c d e FALSE
e f FALSE
f

I can't figure out how to make this work across multiple column indices so that I can go from the second to the last column (I might have 1 drug or I might have 10 in future). When I try this I get the following error:

df.new3<-df%>%
  mutate(Found = str_detect(string.to.search, .[[2:ncol(df)]]))

Error: Problem with mutate() column Found. i Found = str_detect(string.to.search, .[[2:ncol(df)]]). x recursive indexing failed at level 2

When I just look at what is returned by df[2:ncol(df)], only the three Drug columns are returned as expected, so I feel like I must be using the correct indices.

Any help greatly appreciated.

  • ps I have tried similar code with grepl and get the same results, but I am happy to use that instead of str_detect if that's more logical.

Solution

  • If your string.to.search is always space-delimited (or something easily split), then we can use %in% instead of string-matches; the latter can provide false-positives if you have sub-string matches.

    df <- structure(list(string.to.search = c("a b c", "b c", "c d", "e", "f"), Drug1 = c("b", "c", "e", "f", NA), Drug2 = c("c", "d", NA, NA, NA), Drug3 = c(NA_character_, NA_character_, NA_character_, NA_character_, NA_character_)), row.names = c(NA, -5L), class = "data.frame")
    df
    #   string.to.search Drug1 Drug2 Drug3
    # 1            a b c     b     c  <NA>
    # 2              b c     c     d  <NA>
    # 3              c d     e  <NA>  <NA>
    # 4                e     f  <NA>  <NA>
    # 5                f  <NA>  <NA>  <NA>
    
    df %>%
      mutate(
        Found = Reduce(`|`,
          lapply(subset(., select=Drug1:Drug3),
                 function(z) mapply(`%in%`, z, strsplit(string.to.search, " +"))))
      )
    #   string.to.search Drug1 Drug2 Drug3 Found
    # 1            a b c     b     c  <NA>  TRUE
    # 2              b c     c     d  <NA>  TRUE
    # 3              c d     e  <NA>  <NA> FALSE
    # 4                e     f  <NA>  <NA> FALSE
    # 5                f  <NA>  <NA>  <NA> FALSE