I'm working with fuzzy wuzzy in python and while it claims it works with a levenshtein distance, I find that many strings with a single character different produce different results. For example.
>>>fuzz.ratio("vendedor","vendedora")
94
>>>fuzz.ratio("estagiário","estagiária")
90
>>> fuzz.ratio("abcdefghijlmnopqrst","abcdefghijlmnopqrsty")
97
>>>fuzz.ratio("abc","abcd")
86
>>>fuzz.ratio("a","ab")
67
I guess levenshtein distance should be the same as there is a single character distance in all the examples, but I understand this is not simple distance, it is some sort of "equality percentage" of some sort.
I tried to understand how it works but I cannot seem to understand. My very long string gives a 97 and the very short a 67. I guess it would mean the larger the string, there is less impact on a single character. However for the "vendedor","vendedora" and "estagiário","estagiária" example, that is not the case, as the latter is larger than the former.
How does this work?
I am currently matching user input job titles, trying to connect mistyped names with correctly typed names etc. is there a better package for my task?
You are correct about how fuzzywuzzy works in general. A larger output number from the fuzz.ratio
function means that the strings are closer to one another (with a 100 being a perfect match). I preformed a couple of additional test cases to check out how it worked. Here they are:
fuzz.ratio("abc", "abce") #to show which extra letter doesn't matter.
86
fuzz.ratio("abcd", "abce") #to show that replacing a number is worse than adding.
75
fuzz.ratio("abc", "abc") #to find what a match gives.
100
From these tests, we can see that replacing a number has a larger effect on the ratio calculation than adding a letter (this is why estagiário/estagiária was less of a match than vendedor/vendedora, despite being longer). According to this, the package can also be used to auto select the best choice from a list of possible matches, and thus I think it would be a good choice for your intended purpose.