I have a problem where I need to do a fuzzy lookup in a hash map, i.e. return the value corresponding to that key that most closely resembles the query, in my case measured by the Levenshtein distance.
My current approach is to subclass dict
with a special lookup method that computes the Levenshtein distance against all keys, then returns the value of the key with the lowest score. Basically this:
import Levenshtein
class FuzzyLookupDict(dict):
def fuzzy_lookup(self, query):
levs = [(key, Levenshtein.ratio(query, key)) for key in self.keys()]
key, score = max(levs, key=lambda lev: lev[1])
return self.get(key)
Is this a good approach or is there a better solution that I haven't thought of?
This problem is normally solved with Levenshtein automata. A Levenshtein automaton for a string w and a number n is a finite state automaton that can recognize the set of all strings whose Levenshtein distance from w is at most n.
This algorithm is significantly faster than using dynamic programming to compute the Levenshtein distance separately for each dictionary word.
Jule Jacob's blog post Levenshtein automata can be simple and fast is a good starting point and Nick Johnsonz's Damn Cool Algorithms: Levenshtein Automata is a more in depth intro.
You can find some Python implementations on Github, for example https://github.com/antoinewdg/pyffs.