Search code examples
pythonlevenshtein-distance

python-Levenshtein ratio calculation


I have the following two strings:

a = 'bjork gudmundsdottir'
b = 'b. gudmundsson gunnar'

The Levenshtein distance between the two is 12. When I use the following formula for Levenshtein distance, I get a discrepancy of 0.01 with the python-Levenshtein library:

>>> Ldist / max(len( a ), len( b ))
>>> float(12)/21
0.5714285714285714
# python-Levenshtein
Levenshtein.ratio(a,b)
0.5853658536585366
# difflib
>>> seq=difflib.SequenceMatcher(a=a,b=b)
>>> seq.ratio()
0.5853658536585366

What accounts for this difference? What am I doing incorrectly in my calculation. Note that I have reviewed this How python-Levenshtein.ratio is computed similar question and it does not quite answer what I am asking.

Could someone please explain the formula that is used to calculate the above ratio?


Solution

  • From Lukas's comment, the reason for this is that the ratio() uses a cost of 2 for replace operations rather than the normal cost of 1 for the Levenshtein distance. Here's an example calculation:

    a = 'bjork gudmundsdottir'
    b = 'b. gudmundsson gunnar'
    
    >>> Levenshtein.editops(a,b)
    [('delete', 1, 1), ('delete', 2, 1), ('delete', 3, 1), ('replace', 4, 1), ('replace', 14, 11), ('insert', 16, 13), ('insert', 16, 14), ('insert', 16, 15), ('insert', 16, 16), ('replace', 16, 17), ('replace', 17, 18), ('replace', 18, 19)]
    
    >>> ldist = sum([2 for item in Levenshtein.editops(a,b) if item[0] == 'replace']) 
              + sum([1 for item in Levenshtein.editops(a,b) if item[0] != 'replace']) # 17
    ln = len(a) + len(b) # 41
    
    >>> (41.0-17.0)/41.0
    0.5853658536585366
    >>> Levenshtein.ratio(a,b)
    0.5853658536585366