Search code examples
pythondynamic-programminglevenshtein-distanceedit-distance

How to use LRU cache for non-hashable lists?


The levenshtein distance for characters in strings can be computed with the lru_cache:

from functools import lru_cache
from itertools import zip_longest

@lru_cache(maxsize=4095)
def ld(s, t):
    """
    Levenshtein distance memoized implementation from Rosetta code:
    https://rosettacode.org/wiki/Levenshtein_distance#Python
    """
    if not s: return len(t)
    if not t: return len(s)
    if s[0] == t[0]: return ld(s[1:], t[1:])
    l1 = ld(s, t[1:])      # Deletion.
    l2 = ld(s[1:], t)      # Insertion.
    l3 = ld(s[1:], t[1:])  # Substitution.
    return 1 + min(l1, l2, l3)

E.g.

>>> ld("hello how are the you ?", "how Halo how you are the ?")
13

However if I want to track the distance between 2 lists of words instead of characters in strings, without the lru_cache it works:

from functools import lru_cache
from itertools import zip_longest

#@lru_cache(maxsize=4095)
def ld(s, t):
    """
    Levenshtein distance memoized implementation from Rosetta code:
    https://rosettacode.org/wiki/Levenshtein_distance#Python
    """
    if not s: return len(t)
    if not t: return len(s)
    if s[0] == t[0]: return ld(s[1:], t[1:])
    l1 = ld(s, t[1:])      # Deletion.
    l2 = ld(s[1:], t)      # Insertion.
    l3 = ld(s[1:], t[1:])  # Substitution.
    return 1 + min(l1, l2, l3)

Usage:

 >>> ld("hello how are the you ?".split(), "how Halo how you are the ?".split())

However, without the LRU cache, it's not going to work with longer list of words.

And with the LRU cache, it throws a unhashable type error:

>>> ld("hello how are the you ?".split(), "how Halo how you are the ?".split())

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-1de95d00ebae> in <module>
----> 1 ld("hello how are the you ?".split(), "how Halo how you are the ?".split())

TypeError: unhashable type: 'list'

My questions are:

  • Is it possible to make the list of strings comparison hashable such that the lru_cache can be used?
  • If it's not possible to somehow hash the list of words, how should I go about computing the distance efficiently to find the distance between two lists of words? Only through dynamic programming?

Solution

  • Since you are not modifying the lists, but only slicing, you may pass tuples, which are hashable.

    s = "hello how are the you ?".split()
    t = "how Halo how you are the ?".split()
    ld(tuple(s), tuple(t))
    

    Otherwise, you may avoid using lru_cached functions by using loops with extra space, where you memoize calculations