python string unicode levenshtein-distance edit-distance

How is Levenshtein Distance calculated on Simplified Chinese characters?

I have 2 queries:

    query1:你好世界
    query2:你好

When i run this code using the python library Levenshtein:

from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist

I get an output of 12. Now the question is how is the value 12 derived?

Because in terms of strokes difference, theres definitely more than 12.

Solution

According to its documentation, it supports unicode:

It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).

You need to make sure the Chinese characters are in unicode though:

In [1]: from Levenshtein import distance, hamming, median

In [2]: query1 = '你好世界'

In [3]: query2 = '你好'

In [4]: print distance(query1,query2)
6

In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2