Search code examples
rstringlevenshtein-distance

R: String Fuzzy Matching using jarowinkler


I have two vector of type character in R.

I want to be able to compare the reference list to the raw character list using jarowinkler and assign a % similarity score. So for example if i have 10 reference items and twenty raw data items, i want to be able to get the best score for the comparison and what the algorithm matched it to (so 2 vectors of 10). If i have raw data of size 8 and 10 reference items, i should only end up with a 2 vector result of 8 items with the best match and score per item

item, match, matched_to ice, 78, ice-cream

Below is my code which isn't much to look at.

NumItems.Raw = length(words)
NumItems.Ref = length(Ref.Desc)

for (item in words) 
{
  for (refitem in Ref.Desc)
  {
    jarowinkler(refitem,item)

    # Find Best match Score
    # Find Best Item in reference table
    # Add both items to vectors
    # decrement NumItems.Raw
    # Loop
  }
} 

Solution

  • Using a toy example:

    library(RecordLinkage)
    library(dplyr)
    
    ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
    words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')
    
    wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
    wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
    summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])
    

    gives

     words     match matched_to
    1   cat 1.0000000        cat
    2   cow 1.0000000        cow
    3   dog 1.0000000        dog
    4   emu 0.5277778       bear
    5 horse 1.0000000      horse
    6  kiwi 0.5350000      koala
    7   pig 1.0000000        pig
    8 sheep 1.0000000      sheep
    

    Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.