Search code examples
pythonlevenshtein-distance

Levenshtein edit distance Python


This piece of code returns the Levenshtein edit distance of 2 terms. How can i make this so that insertion and deletion only costs 0.5 instead of 1 ? substitution should still costs 1.

def substCost(x,y):
   if x == y: 
      return 0
   else: 
      return 1

def  levenshtein(target, source):
   i = len(target); j = len(source)
   if i == 0:  
      return j
   elif j == 0: 
      return i

   return(min(levenshtein(target[:i-1],source)+1,
          levenshtein(target, source[:j-1])+1,
          levenshtein(target[:i-1], source[:j-1])+substCost(source[j-1],target[i-1])))

Solution

  • There are two places you need to account for the reduced cost of adding or removing a vowel. They are the return j and return i lines in the base cases of your function, and the +1s in the min call after the first two recursive calls.

    We need to change each of those to use a "ternary" expression: 0.5 if ch in 'aeiou' else 1 in place of assuming a cost of 1 per character added or removed.

    For the base cases, we can replace the return values with sum calls on a generator expression that includes the ternary expression:

    if i == 0:  
        return sum(0.5 if ch in 'aeiou' else 1 for ch in source)
    elif j == 0: 
        return sum(0.5 if ch in 'aeiou' else 1 for ch in target)
    

    For later cases, we can replace the +1 with the ternary expression itself (with an index rather than the ch iteration variable):

    return min(levenshtein(target[:i-1],source) + (0.5 if target[-1] in 'aeiou' else 1),
               levenshtein(target, source[:j-1]) + (0.5 if source[-1] in 'aeiou' else 1),
               levenshtein(target[:i-1], source[:j-1])+substCost(source[j-1],target[i-1]))
    

    If you wanted to generalize this, you could move the ternary expression out into its own function, named something like addCost, and call it from the code in the levenshtein function.