I have 2 queries:
query1:你好世界
query2:你好
When i run this code using the python library Levenshtein:
from Levenshtein import distance, hamming, median
lev_edit_dist = distance(query1,query2)
print lev_edit_dist
I get an output of 12. Now the question is how is the value 12 derived?
Because in terms of strokes difference, theres definitely more than 12.
According to its documentation, it supports unicode:
It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses).
You need to make sure the Chinese characters are in unicode though:
In [1]: from Levenshtein import distance, hamming, median
In [2]: query1 = '你好世界'
In [3]: query2 = '你好'
In [4]: print distance(query1,query2)
6
In [5]: print distance(query1.decode('utf8'),query2.decode('utf8'))
2