Search code examples
rstringtext-miningstringrgrepl

Detect part of a string in R (not exact match)


Consider the following dataset :

a <- c("my house", "green", "the cat is", "a girl")
b <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
c <- c("T", "T", "T", "F")
df <- data.frame(string1=a, string2=b, returns=c)

I m trying to detect string1 in string2 BUT my goal is to not only detect exact matching. I m looking for a way to detect the presence of string1 words in string2, whatever the order words appear. As an example, the string "my beautiful house is cool" should return TRUE when searching for "my house".

I have tried to illustrate the expected behaviour of the script in the "return" column of above the example dataset.

I have tried grepl() and str_detect() functions but it only works with exact match. Can you please help ? Thanks in advance


Solution

  • The trick here is to not use str_detect as is but to first split the search_words into individual words. This is done in strsplit() below. We then pass this into str_detect to check if all words are matched.

    library(stringr)
    search_words <- c("my house", "green", "the cat is", "a girl")
    words <- c("my beautiful house is cool", "the apple is green", "I m looking at the cat that is sleeping", "a boy")
    
    patterns <- strsplit(search_words," ")
    
    mapply(function(word,string) all(str_detect(word,string)),words,patterns)