Search code examples
rdataframespell-checkingstringdist

Getting the closest string matches between two lists


I am a real beginner in R and I just have this two lists with names of cities in them. One list has user-generated names (people spell messy) and another list with the orthography of the names.

I tried using the package stringdist, and I ended up with a code that loops (for) and gives the closest match. But i could only input vectors, and I really need to use data frames.

This is my code (oh boy, it feels awkward):

 input <- "BAC"   #misspelled 
  correct <- c("ABC", "DEF", "GHI", "JKL") #list with all correct names
  shortest <- -1a

for (word in correct) {

  dist <- stringdist(input, word)
  #checks if it's a match!
  if (dist == 0){
    closest <- palavra
    shortest <- 0

    break

  }

  if(dist <= shortest || shortest < 0){
    closest <- word
    shortest <- dist

  }

}


if(shortest == 0){ 
  print("It's a match!")
} else {
  print(closest)
}

The ideia is to use this code to have an idea, I wanted to go from this to using stringdist in each row of my data frame. I don't even know if this is a good idea, if this would take too much processing power, don't feel afraid to say it's stupid. Thanks!


Solution

  • there is a special function for that in the stringdist package for that called amatch:

    input <- "BAC"   #misspelled 
    correct <- c("ABC", "DEF", "GHI", "JKL") 
    
    correct[amatch(input, correct, maxDist = Inf)]
    # "ABC"
    

    this will also work for multiple input words at once, so no need to use a for-loop

    input <- c("New Yorkk", "Berlyn", "Pariz") # misspelled 
    correct <- c("Berlin", "Paris", "New York", "Los Angeles") # correct names
    
    correct_words <- correct[amatch(input, correct, maxDist = Inf)]
    data.frame(input, correct_words)
    
     #       input correct_words
     #   New Yorkk      New York
     #      Berlyn        Berlin
     #       Pariz         Paris