Search code examples
pythonstring-matchingfuzzywuzzy

Python's fuzzywuzzy returns unpredictable results


I'm working with fuzzy wuzzy in python and while it claims it works with a levenshtein distance, I find that many strings with a single character different produce different results. For example.

>>>fuzz.ratio("vendedor","vendedora")
94
>>>fuzz.ratio("estagiário","estagiária")
90
>>> fuzz.ratio("abcdefghijlmnopqrst","abcdefghijlmnopqrsty")
97
>>>fuzz.ratio("abc","abcd")
86
>>>fuzz.ratio("a","ab")
67

I guess levenshtein distance should be the same as there is a single character distance in all the examples, but I understand this is not simple distance, it is some sort of "equality percentage" of some sort.

I tried to understand how it works but I cannot seem to understand. My very long string gives a 97 and the very short a 67. I guess it would mean the larger the string, there is less impact on a single character. However for the "vendedor","vendedora" and "estagiário","estagiária" example, that is not the case, as the latter is larger than the former.

How does this work?

I am currently matching user input job titles, trying to connect mistyped names with correctly typed names etc. is there a better package for my task?


Solution

  • You are correct about how fuzzywuzzy works in general. A larger output number from the fuzz.ratio function means that the strings are closer to one another (with a 100 being a perfect match). I preformed a couple of additional test cases to check out how it worked. Here they are:

    fuzz.ratio("abc", "abce") #to show which extra letter doesn't matter.
    86
    fuzz.ratio("abcd", "abce") #to show that replacing a number is worse than adding.
    75
    fuzz.ratio("abc", "abc") #to find what a match gives.
    100
    

    From these tests, we can see that replacing a number has a larger effect on the ratio calculation than adding a letter (this is why estagiário/estagiária was less of a match than vendedor/vendedora, despite being longer). According to this, the package can also be used to auto select the best choice from a list of possible matches, and thus I think it would be a good choice for your intended purpose.