Search code examples
rmatchingstringrgreplfuzzy

How to use the percentage of matches to decide if two characters match in R


Context

I have two vectors. fruits_Jack_eat is a vector of length=1 that stores the fruits Jack ate. fruits_list is a vector of length=3 that stores different types of fruits.

Question

I want to find out if Jack ate 1 or more fruits in the fruits_list. But the situation is not that simple. fruits_list[1] is 'Navel orange'. one of the fruits Jack ate is XXXorange. although XXXorange is not exactly the same as Navel orange, I still think the situation is a match.

Reproducible code

fruits_Jack_eat = 'XXXorange,PPPapple,QQQbanana'
  
fruits_list = c('Navel orange', 'Super big apple', 'Very yellow banana')

Expect output

When I enter fruits_Jack_eat and fruits_list, the result should return a dataframe. The first column is a logical vector that indicates whether or not the match is on. The second column is a character vector indicating the characters in fruits_Jack_eat that are similar to fruits_list. Maybe like this:

df_output = data.frame(matched = TRUE, matched_char = c('orange,apple,banana'))

> df_output
  matched        matched_char
1    TRUE orange,apple,banana

What I've done

  1. how to get percentage character match between two strings using sqldf in R
  2. Identify the percentage of string Match in R

Solution

  • Maybe this helps

    library(stringr)
    library(tibble)
     matched_char <- str_extract(fruits_list, 
        str_replace_all(str_remove_all(fruits_Jack_eat, "[A-Z]+"), ",", "|"))
    tibble(matched = any(length(matched_char) > 0),
         matched_char = str_c(matched_char, collapse = ","))
    # A tibble: 1 × 2
      matched matched_char       
      <lgl>   <chr>              
    1 TRUE    orange,apple,banana