How to create a function that returns the most similar word from a list of words, even if the word is not exactly the same?
The function should have two inputs: one for the word and the other for the list. The function should return the word that is most similar to the word.
lst = ['apple','app','banana store','pear','beer']
func('apple inc.',lst)
>>'apple'
func('banana',lst)
>>'banana store'
From doing some research, it seems that I have to use the concepts of Fuzzy String Matching, NLTK, and Levenshtein-distance, which I'm having a hard time trying to implement in creating a function like this.
I should also point out that by similar, I just mean the characters and I'm not concerned for the meaning of the word at all.
Slow solution for debugging:
def func(word, lst):
items = sorted((dist(word, w), w) for w in lst)
# Print items here for debugging.
if not items:
raise ValueError('List of words is empty.')
return items[0][1]
Or, this is faster and uses less memory:
def func(word, lst):
return min((dist(word, w), w) for w in lst)[1]
See https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for implementing dist
. One of the answers has a link to a Levenshtein-distance implementation.