Search code examples
pythonfuzzy-searchfuzzywuzzy

Why does fuzzywuzzy not consider character order


Consider this example:

>> from fuzzywuzzy import process
>> choices = ['account', 'update', 'query']
>> process.extract('u', choices)
[('account', 90), ('update', 90), ('query', 90)]

In the above case, it's confusing for my end-users that account is ranked above update for the given string. In this case account happens to be arbitrarily placed in front due to list order, since all matches share the same score. However, I would have imagined that update would have a higher score simply because the character u shows up earlier in the string.

Is this a conceptual error or am I not using the correct scorer here?


Solution

  • First of all you are using a "bad" scorer. Based on your scores you are probably using difflib. You should switch to the python-Levenshtein based implementation. This can be done with the scorer parameter.

    from fuzzywuzzy import process
    from fuzzywuzzy import fuzz
    
    def MyScorer(s1, s2):
        fratio = fuzz.ratio(s1, s2)
        fratio -= (s2.find(s1)*5)
        return fratio
    
    choices = ['account', 'update', 'query']
    dex = process.extract('u', choices, scorer=fuzz.token_sort_ratio)
    mex = process.extract('u', choices, scorer=MyScorer)
    print("Default Scorer:", dex)
    print("MyScorer:", mex)
    

    Now the output is

    [('query', 33), ('update', 29), ('account', 25)]

    Which is better, but Levenshtein didn't really care about the position,

    Levenshtein distance (LD) is a measure of the similarity between two strings, which we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t.

    that's why i added the definition for MyScorer() where you can implement your own algorithm which takes the position into account. I also added a example of an implementation which takes the position into account (but i'm not really experienced at designing such algorithms, so don't expect this one to be perfect or even usable). Anyway, the output of MyScorer is:

    [('update', 29), ('query', 28), ('account', 5)]