Search code examples
rtmstringr

Count number of words match in phrase


I have two big list of phrases. I need to check the percentage of words exist in the other list and get best result out of other list.

A <- data.frame(name = c(
  "X-ray right leg arteries",
  "x-ray left shoulder",
  "x-ray leg arteries",
  "x-ray leg with 20km distance"
), stringsAsFactors = F)

B <- data.frame(name = c(
  "X-ray left leg arteries",
  "X-ray leg",
  "xray right leg",
  "X-ray right leg arteries"
), stringsAsFactors = F)

fuzzy_prep_words <- function(words) {
  words <- unlist(strsplit(tolower(gsub("[[:punct:]]", "", words)), "\\W+"))
  return(words)
}

fuzzy_prep_words(A$name)
fuzzy_prep_words(B$name)

I am able to extract the words from the list but not able to calculate the number and proportion of words matched in the other list.

"X-ray right leg arteries" has exact match in B so it should return two columns - Match : ""X-ray right leg arteries" and Distance = 100%. For second phrase - "x-ray left shoulder", it should return match - "X-ray left leg arteries" and distance 66.67% as 2 words matched out of 3 words in "x-ray left shoulder". For 3rd phrase, it should return any of "X-ray left leg arteries", "X-ray right leg arteries".

I have already explored string distance algorithms such as LV, COSINE, LCS so I don't want to use it as I have big phrases in my real dataset.


Solution

  • How about something like this?

    m <- lapply(strsplit(tolower(gsub("[[:punct:]]", "", A$name)), " "), function(w1)
        do.call(rbind.data.frame, lapply(strsplit(tolower(gsub("[[:punct:]]", "", B$name)), " "), function(w2) {
            cbind.data.frame(
                matches_string_from_B = paste(w2, collapse = " "),
                percentage = sum(w1 %in% w2) / length(w1) * 100)
            }
        ))
    )
    names(m) <- tolower(gsub("[[:punct:]]", "", A$name));
    
    m;
    $`xray right leg arteries`
        matches_string_from_B percentage
    1  xray left leg arteries         75
    2                xray leg         50
    3          xray right leg         75
    4 xray right leg arteries        100
    
    $`xray left shoulder`
        matches_string_from_B percentage
    1  xray left leg arteries   66.66667
    2                xray leg   33.33333
    3          xray right leg   33.33333
    4 xray right leg arteries   33.33333
    
    $`xray leg arteries`
        matches_string_from_B percentage
    1  xray left leg arteries  100.00000
    2                xray leg   66.66667
    3          xray right leg   66.66667
    4 xray right leg arteries  100.00000
    
    $`xray leg with 20km distance`
        matches_string_from_B percentage
    1  xray left leg arteries         40
    2                xray leg         40
    3          xray right leg         40
    4 xray right leg arteries         40
    

    Explanation: Split entries from A$name into words, calculate percentage of matching words from split entries from B$name, and store in list of dataframes. Use toupper and gsub("[[:punct:]]", "", ...) to make matching case insensitive and ignore punctuation characters.

    Update

    To get the best match (percentage-wise) you can do:

    do.call(rbind.data.frame, lapply(m, function(x) x[which.max(x$percentage), ]))
    #                              matches_string_from_B percentage
    #xray right leg arteries     xray right leg arteries  100.00000
    #xray left shoulder           xray left leg arteries   66.66667
    #xray leg arteries            xray left leg arteries  100.00000
    #xray leg with 20km distance  xray left leg arteries   40.00000