Search code examples
pythonstringunicodelevenshtein-distanceedit-distance

How is Levenshtein Distance calculated on Simplified Chinese characters?


I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.


Solution

  • According to its documentation, it supports unicode:

    It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

    You need to make sure the Chinese characters are in unicode though:

    In [1]: from Levenshtein import distance, hamming, median
    
    In [2]: query1 = '你好世界'
    
    In [3]: query2 = '你好'
    
    In [4]: print distance(query1,query2)
    6
    
    In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
    2