python regex list nltk levenshtein-distance

How to return the most similar word from a list of words?

How to create a function that returns the most similar word from a list of words, even if the word is not exactly the same?

The function should have two inputs: one for the word and the other for the list. The function should return the word that is most similar to the word.

lst = ['apple','app','banana store','pear','beer']

func('apple inc.',lst)
>>'apple'
func('banana',lst)
>>'banana store'

From doing some research, it seems that I have to use the concepts of Fuzzy String Matching, NLTK, and Levenshtein-distance, which I'm having a hard time trying to implement in creating a function like this.

I should also point out that by similar, I just mean the characters and I'm not concerned for the meaning of the word at all.

Solution

Slow solution for debugging:

def func(word, lst):
  items = sorted((dist(word, w), w) for w in lst)
  # Print items here for debugging.
  if not items: 
    raise ValueError('List of words is empty.')
  return items[0][1]

Or, this is faster and uses less memory:

def func(word, lst):
  return min((dist(word, w), w) for w in lst)[1]

See https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for implementing dist. One of the answers has a link to a Levenshtein-distance implementation.