Search code examples
pythonregexlistnltklevenshtein-distance

How to return the most similar word from a list of words?


How to create a function that returns the most similar word from a list of words, even if the word is not exactly the same?

The function should have two inputs: one for the word and the other for the list. The function should return the word that is most similar to the word.

lst = ['apple','app','banana store','pear','beer']

func('apple inc.',lst)
>>'apple'
func('banana',lst)
>>'banana store'

From doing some research, it seems that I have to use the concepts of Fuzzy String Matching, NLTK, and Levenshtein-distance, which I'm having a hard time trying to implement in creating a function like this.

I should also point out that by similar, I just mean the characters and I'm not concerned for the meaning of the word at all.


Solution

  • Slow solution for debugging:

    def func(word, lst):
      items = sorted((dist(word, w), w) for w in lst)
      # Print items here for debugging.
      if not items: 
        raise ValueError('List of words is empty.')
      return items[0][1]
    

    Or, this is faster and uses less memory:

    def func(word, lst):
      return min((dist(word, w), w) for w in lst)[1]
    

    See https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for implementing dist. One of the answers has a link to a Levenshtein-distance implementation.