Consider this example:
>> from fuzzywuzzy import process
>> choices = ['account', 'update', 'query']
>> process.extract('u', choices)
[('account', 90), ('update', 90), ('query', 90)]
In the above case, it's confusing for my end-users that account is ranked above update for the given string. In this case account happens to be arbitrarily placed in front due to list order, since all matches share the same score. However, I would have imagined that update would have a higher score simply because the character u shows up earlier in the string.
Is this a conceptual error or am I not using the correct scorer here?
First of all you are using a "bad" scorer. Based on your scores you are probably using difflib. You should switch to the python-Levenshtein
based implementation. This can be done with the scorer
parameter.
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
def MyScorer(s1, s2):
fratio = fuzz.ratio(s1, s2)
fratio -= (s2.find(s1)*5)
return fratio
choices = ['account', 'update', 'query']
dex = process.extract('u', choices, scorer=fuzz.token_sort_ratio)
mex = process.extract('u', choices, scorer=MyScorer)
print("Default Scorer:", dex)
print("MyScorer:", mex)
Now the output is
[('query', 33), ('update', 29), ('account', 25)]
Which is better, but Levenshtein
didn't really care about the position,
Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.
that's why i added the definition for MyScorer()
where you can implement your own algorithm which takes the position into account. I also added a example of an implementation which takes the position into account (but i'm not really experienced at designing such algorithms, so don't expect this one to be perfect or even usable). Anyway, the output of MyScorer is:
[('update', 29), ('query', 28), ('account', 5)]