Search code examples
pythonpython-2.7setfuzzy-search

Python fuzzy element checking


I have a Python 2.7 set object that includes the names of data categories, and I would like to be able to do some form of fuzzy element checking to see if part of a user given input is an element of the set.

Here is a toy example, to explain what I would like. Given the following set and user input:

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'

I would like the program to print out something like the following:

'yellow_ball' not found, did you mean 'red_ball', or 'green_ball'?

So far I have the following:

import re

SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'

# all members of my set are lowercase and separated by an underscore
user_input_list = user_input.lower().split() # for use in fuzzy search
user_input = "_".join(user_input_list) # convert to yellow_ball for element check
regex = None
matches = []

if user_input not in SET:
    # FUZZY ELEMENT CHECK
    for item in user_input_list:
        regex = re.compile(item)
        for element in SET:
            if regex.match(element):
                matches.append(element)

    if len(matches) > 0:
        print '\'%s\' not found, did you mean %s' % (user_input, ", ".join(['\'' + x + '\'' for x in matches]))
    else:
        print '\'%s\' not found.' % user_input

Is there a more efficient way of doing this, perhaps that uses third party libraries?

Thanks for your help, Geraint


Solution

  • If you're interested in 3rd party modules, there's a nice little module I like to use for this sort of thing called fuzzywuzzy, for fuzzy string matching in Python.

    This module performs fuzzy string matching with just a couple of lines of code.

    Here's an example of how you can use it:

    >>> from fuzzywuzzy import process
    >>> choices = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
    >>> query = 'yellow ball'
    

    We've set up our choices and input, now we can retrieve the closest matches.

    >>> process.extract(query, choices)
    [('red_ball', 53), ('green_ball', 48), ('red_cup', 13), ('green_cup', 10)]
    

    This returns all choices in descending order of closeness of match. The distance between strings is computed using the Levenshtein Distance metric. You can extract the top n items and propose them as valid alternatives if the original input isn't present in the choices set.

    If you want only the top match, just do this:

    >>> process.extractOne(query, choices)
    ('red_ball', 53)
    

    You can peruse more examples using fuzzy string matching here.