Search code examples
rsapplygrepl

Row-wise find substrings from one column in other column (combine grepl, strsplit and *apply)


I want to know, whether a string in one column exists in a range of columns. The search column can contain several strings, separated by ", ". I call them "search terms". I don't care if one or several terms are found, but I need to know if there are duplicates. Here is some mock-data

df <- data.frame(
  a=c("a","b","c, d","d"), 
  b=c(NA, "k", NA,"k"), 
  c=c("c1","c2","c3","c4, c5"), 
  search_terms=c("a",NA,"c, a","a, c5"))
df

     a    b      c search_terms
1    a <NA>     c1            a
2    b    k     c2         <NA>
3 c, d <NA>     c3         c, a
4    d    k c4, c5        a, c5

What I would like as a result is:

  test
1 search term found in a_c
2 <NA>
3 search term found in a_c
4 search term found in a_c

explainer:

  1. Search term "a" is in column a
  2. There is no search term
  3. Search term "c" is in column a
  4. Search term "c5" is in column c

So its possible to search for a string in all substrings of the search column. The following code correctly identifies "c5" in row 4. But I fail to do it a rowwise match.

df %>% mutate(test=ifelse(sapply(strsplit(df$search_terms, ", "), 
                                 function(x) {any(x == "c5")}),
                          "search term found in a_c",NA)) %>%
  select(test)

                test
1               <NA>
2               <NA>
3               <NA>
4 search term found in a_c

I further manage to check the existence rowwise, but not when the input is a list of strings. This code correctly identifies the first match, but neither the third nor the fourth.

df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>% 
  mutate(test=ifelse(mapply(grepl, search_terms,a_c),
                      "search term found in a_c",NA))%>%
  select(test)

                      test
1 search term found in a_c
2                     <NA>
3                     <NA>
4                     <NA>

I hoped to combine the two along the following lines, but failed with grepl only taking the first element. So it correctly identifies the first and the third match, but fails to identify the match in row 4. So why does the any-command not work here, but in the first line of code?

df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>% 
  mutate(test=ifelse(apply(.,1,function(x) {
    sapply(strsplit(x["search_terms"],", "), function(y) {
      any(grepl(y,x["a_c"]))
      })
    }),"search_term in a_c",NA)
    ) %>%
  select(test)


                test
1 search term found in a_c
2               <NA>
3 search term found in a_c
4               <NA>

Warning messages:
1: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used 
2: Problem while computing `test = ifelse(...)`.
ℹ argument 'pattern' has length > 1 and only the first element will be used 

Solution

  • There was an answer which now seems deleted. It didn't work as I wanted but gave some crucial insights for the following solution:

    library(dplyr)
    df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
     mutate(test=ifelse(apply(.,1,function(x) {
       sapply(strsplit(x["search_terms"],", "), function(y) {
         any(sapply(y, function(z) grepl(z,x["a_c"])))
       })
     }),"search_term in a_c",NA)
     ) %>%
     select(test)
    

    The thing is that strsplit returns a list and hence requires a sapply command within the any function. As I understand the functioning of those nested *apply commands:

    1. apply(.,… makes sure the following function(x) is applied on every row of df
    2. sapply(strsplit(applies strsplit and outputs the list of search terms. For every list of search terms, the function(y) is applied
    3. any(sapply( applies the grepl function on every it of the list of search terms.

    However, I don't understand the logic fully and have the impression there is an easier way to solve that, with less *apply-functions. I further could imagine that there could be a cleaner tidyr-approach. However, the function gives me the desired output (here, with a slightly more complex df).

    df <- data.frame(
      a=c("a","b","c, d","d", "e"),
      b=c(NA, "x", NA,"y", NA),
      c=c("c1","c2","c3","c4, c5", "c5, c6"),
      search_terms=c("a",NA,"c, a","x", "l, c6"),stringsAsFactors = F)
    
    library(dplyr)
    df %>% tidyr::unite(a_c,a:c, na.rm = TRUE, remove=F,sep = ',') %>%
      mutate(test=ifelse(apply(.,1,function(x) {
        sapply(strsplit(x["search_terms"],", "), function(y) {
          any(sapply(y, function(z) grepl(z,x["a_c"])))
        })
      }),"search_term in a_c",NA)
      ) %>%
      select(-a_c)
    
         a    b      c search_terms               test
    1    a <NA>     c1            a search_term in a_c
    2    b    x     c2         <NA>               <NA>
    3 c, d <NA>     c3         c, a search_term in a_c
    4    d    y c4, c5            x               <NA>
    5    e <NA> c5, c6        l, c6 search_term in a_c