Is there a good way to use levenstein distance to match one particular string to any region within a second longer string?
Example:
str1='aaaaa'
str2='bbbbbbaabaabbbb'
if str1 in str2 with a distance < 2:
return True
So in the above example part of string 2 is aabaa
and distance(str1,str2) < 2
so the statement should return True
.
The only way I can think to do this is take 5 chars from str2 at a time, compare that with str1 and then repeat this moving through str2. Unfortunately this seems really inefficient and I need to process a large amount of data this way.
The trick is to generate all the substrings of appropriate length of b
, then compare each one.
def lev_dist(a,b):
length_cost = abs(len(a) - len(b))
diff_cost = sum(1 for (aa, bb) in zip(a,b) if aa != bb)
return diff_cost + length_cost
def all_substr_of_length(n, s):
if n > len(s):
return [s]
else:
return [s[i:i+n] for i in range(0, len(s)-n+1)]
def lev_substr(a, b):
"""Gives minimum lev distance of all substrings of b and
the single string a.
"""
return min(lev_dist(a, bb) for bb in all_substr_of_length(len(a), b))
if lev_substr(str1, str2) < 2:
# it works!