Search code examples
pythonfuzzywuzzy

fuzzy wuzzy WRatio for Uppercase detection


I need help in figuring out why

fuzz.WRatio('Māne', 'mane', force_ascii=True) => 75%

and also

fuzz.WRatio('Māne', 'Mane', force_ascii=True) => 75%

I would expect the force_ascii parameter to enforce more accuracy. Thank you.


Solution

  • There are two arguments force_ascii and full_process when working with fuzz.WRatio in fuzzywuzzy, that are both True by default. They are both used for preprocessing the strings (force_ascii is only used when full_process is True aswell and otherwise ignored).

    1) When using force_ascii=False, full_process=False The strings are not changed before matching them so e.g. uppercase/lowercase matters.

    2) When using force_ascii=False, full_process=True All non alphanumeric characters in the strings are replaced with a whitespace, the strings are lowercased and whitespaces from beginning and end are trimmed. So for example "Mäne!" -> "Mäne " -> "mäne " -> "mäne"

    2) When using force_ascii=True, full_process=True This does the same as 2) but removes all non ascii characters beforehand. So for example "Mäne!" -> "Mne!" -> "Mne " -> "mne " -> "mne"

    I do not really think that it is a good thing that force_ascii defaults to true, since I personally do not really want this behaviour in 99% of the cases, but most people using fuzzywuzzy are not even aware of this behaviour. Beside this it appears to have a bug, since e.g

    > utils.full_process("ā", force_ascii=True)
    'ā'
    

    while it is clearly no ascii character and should therefore return an empty string.

    In your case where you want it to consider any difference between the two strings you should call

    > fuzz.WRatio('Māne', 'mane', full_process=False)
    50
    > fuzz.WRatio('Māne', 'Mane', full_process=False)
    75