I have a Python 2.7 set object that includes the names of data categories, and I would like to be able to do some form of fuzzy element checking to see if part of a user given input is an element of the set.
Here is a toy example, to explain what I would like. Given the following set and user input:
SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'
I would like the program to print out something like the following:
'yellow_ball' not found, did you mean 'red_ball', or 'green_ball'?
So far I have the following:
import re
SET = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
user_input = 'yellow ball'
# all members of my set are lowercase and separated by an underscore
user_input_list = user_input.lower().split() # for use in fuzzy search
user_input = "_".join(user_input_list) # convert to yellow_ball for element check
regex = None
matches = []
if user_input not in SET:
# FUZZY ELEMENT CHECK
for item in user_input_list:
regex = re.compile(item)
for element in SET:
if regex.match(element):
matches.append(element)
if len(matches) > 0:
print '\'%s\' not found, did you mean %s' % (user_input, ", ".join(['\'' + x + '\'' for x in matches]))
else:
print '\'%s\' not found.' % user_input
Is there a more efficient way of doing this, perhaps that uses third party libraries?
Thanks for your help, Geraint
If you're interested in 3rd party modules, there's a nice little module I like to use for this sort of thing called fuzzywuzzy
, for fuzzy string matching in Python.
This module performs fuzzy string matching with just a couple of lines of code.
Here's an example of how you can use it:
>>> from fuzzywuzzy import process
>>> choices = {'red_ball', 'green_ball', 'red_cup', 'green_cup'}
>>> query = 'yellow ball'
We've set up our choices and input, now we can retrieve the closest matches.
>>> process.extract(query, choices)
[('red_ball', 53), ('green_ball', 48), ('red_cup', 13), ('green_cup', 10)]
This returns all choices in descending order of closeness of match. The distance between strings is computed using the Levenshtein Distance metric. You can extract the top n
items and propose them as valid alternatives if the original input isn't present in the choices set.
If you want only the top match, just do this:
>>> process.extractOne(query, choices)
('red_ball', 53)
You can peruse more examples using fuzzy string matching here.