Search code examples
rstringrsapply

R function for finding at least 2 words that match between 2 strings (applied over 2 vectors of strings)?


I have 2 set of strings. Char and Char2 for this example. I am trying to find if Char includes at least 2 words from Char2 (any two words can match). I have yet to get to the "at least 2 words" part, but I must first figure out the matching of any word in each string. Any help would be greatly appreciated.

I have tried using the stringr package a couple of different ways. Please see below. I tried using similar solutions to what Robert answered with in this question: Detect multiple strings with dplyr and stringr

shopping_list <- as.data.frame(c("good apples", "bag of apples", "bag of sugar", "milk x2"))
colnames(shopping_list) <- "Char"

shopping_list2 <- as.data.frame(c("good pears", "bag of sugar", "bag of flour", "sour milk x2"))
colnames(shopping_list2) <- "Char2"

shop = cbind(shopping_list , shopping_list2)
shop$Char = as.character(shop$Char)
shop$Char2 = as.character(shop$Char2)


# First attempt
sapply(shop$Char, function(x) any(sapply(shop$Char2, str_detect, string = x)))

# Second attempt
str_detect(shop$Char, paste(shop$Char2, collapse = '|'))

I get these results:

sapply(shop$Char, function(x) any(sapply(shop$Char2, str_detect, string = x)))
  good apples bag of apples  bag of sugar       milk x2 
        FALSE         FALSE          TRUE         FALSE 


str_detect(shop$Char, paste(shop$Char2, collapse = '|'))
FALSE FALSE  TRUE FALSE

However I am looking for these results:

FALSE TRUE TRUE TRUE

1) FALSE because only 1 word matches 2) TRUE because "bag of" in both 3) TRUE because "bag of" in both 4) TRUE because "milk x2" in both


Solution

  • Here is a function that could help

    match_test <- function (string1, string2) {
      words1 <- unlist(strsplit(string1, ' '))
      words2 <- unlist(strsplit(string2, ' '))
      common_words <- intersect(words1, words2)
      length(common_words) > 1
    }
    

    Here is an example

    string1 <- c("good apples" , "bag of apples", "bag of sugar", "milk x2")
    string2 <- c("good pears" , "bag of sugar", "bag of flour", "sour milk x2")
    vapply(seq_along(string1), function (k) match_test(string1[k], string2[k]), logical(1))
    # [1] FALSE  TRUE  TRUE  TRUE