I am a real beginner in R and I just have this two lists with names of cities in them. One list has user-generated names (people spell messy) and another list with the orthography of the names.
I tried using the package stringdist, and I ended up with a code that loops (for) and gives the closest match. But i could only input vectors, and I really need to use data frames.
This is my code (oh boy, it feels awkward):
input <- "BAC" #misspelled
correct <- c("ABC", "DEF", "GHI", "JKL") #list with all correct names
shortest <- -1a
for (word in correct) {
dist <- stringdist(input, word)
#checks if it's a match!
if (dist == 0){
closest <- palavra
shortest <- 0
break
}
if(dist <= shortest || shortest < 0){
closest <- word
shortest <- dist
}
}
if(shortest == 0){
print("It's a match!")
} else {
print(closest)
}
The ideia is to use this code to have an idea, I wanted to go from this to using stringdist in each row of my data frame. I don't even know if this is a good idea, if this would take too much processing power, don't feel afraid to say it's stupid. Thanks!
there is a special function for that in the stringdist
package for that called amatch
:
input <- "BAC" #misspelled
correct <- c("ABC", "DEF", "GHI", "JKL")
correct[amatch(input, correct, maxDist = Inf)]
# "ABC"
this will also work for multiple input words at once, so no need to use a for-loop
input <- c("New Yorkk", "Berlyn", "Pariz") # misspelled
correct <- c("Berlin", "Paris", "New York", "Los Angeles") # correct names
correct_words <- correct[amatch(input, correct, maxDist = Inf)]
data.frame(input, correct_words)
# input correct_words
# New Yorkk New York
# Berlyn Berlin
# Pariz Paris