Search code examples
pythonlevenshtein-distance

Levenstein distance substring


Is there a good way to use levenstein distance to match one particular string to any region within a second longer string?

Example:

str1='aaaaa'
str2='bbbbbbaabaabbbb'

if str1 in str2 with a distance < 2:
    return True

So in the above example part of string 2 is aabaa and distance(str1,str2) < 2 so the statement should return True.

The only way I can think to do this is take 5 chars from str2 at a time, compare that with str1 and then repeat this moving through str2. Unfortunately this seems really inefficient and I need to process a large amount of data this way.


Solution

  • The trick is to generate all the substrings of appropriate length of b, then compare each one.

    def lev_dist(a,b):
        length_cost = abs(len(a) - len(b))
        diff_cost = sum(1 for (aa, bb) in zip(a,b) if aa != bb)
        return diff_cost + length_cost
    
    def all_substr_of_length(n, s):
        if n > len(s):
            return [s]
        else:
            return [s[i:i+n] for i in range(0, len(s)-n+1)]
    
    def lev_substr(a, b):
        """Gives minimum lev distance of all substrings of b and
        the single string a.
        """
    
        return min(lev_dist(a, bb) for bb in all_substr_of_length(len(a), b))
    
    if lev_substr(str1, str2) < 2:
        # it works!