Search code examples
pythonstring-matchingjupyter-notebookfuzzy-searchfuzzywuzzy

Python FuzzyWuzzy unexpected mismatch between fuzz.ratio and process.extractOne results


I'm working on a code that uses fuzzy string matching to match a dataframe of user inputs (dataframe of lists of strings after some cleaning) to specific words of interest. I use Python Pandas for handling dataframes and the FuzzyWuzzy package for matching strings. I do everything in Anaconda's Jupyter notebook.

The code works just fine (it has an approx. 90% matching accuracy), and I'm at the phase where I'm trying to find out why the code gave false positives or false negatives at certain cases. The code only marks the matches where the score of FuzzyWuzzy's process.extractOne() function was above 80 points.

However, I stumbled upon an odd problem: in a cell the tester input was only an ['x'], and it still got marked to 'minimax', a word of interest, meaning that its score must have been above 80, but it definitely shouldn't have been.

It seems that the modules fuzzywuzzy.fuzz and fuzzywuzzy.process yield different results.

This is what's expected, the score of fuzz.ratio() is low enough:

In [1]: fuzz.ratio('x', 'minimax')
Out [1]: 25

This is the code I actually use, and its result mismatches the previous one:

In [2]: process.extractOne('minimax', ['x'])
Out [2]: ('x', 90)

I checked out and tested many variations of the code and the problem still occured no matter which argument of the process.extractOne function contained the 'x'. Also changing the location of the x in the 'minimax' string (e.g. 'xminima', 'mixnima') didn't change the score. The same went for when I used a different process function (e.g. process.Bests()).

What could be the problem? Do I use the function or the package wrong? Don't forget that in most of the cases my code worked properly.


Solution

  • Both process.extract and process.extractOne will use fuzz.WRatio as scorer by default. fuzz.WRatio calculates the result using multiple scorers, that are weighted. In your example the result is from fuzz.partial_ratio weighted with the factor 0.9. Since x is a substring of minimax it returns 100 * 0.9 = 90.

    You can specify a different scorer in the following way:

    > process.extractOne('minimax', ['x'], scorer=fuzz.ratio)
    ('x', 25)