I have a string I would like to match against a list of candidates. Here is an example:
# ignore case
string = "The Shining" # The Stanley Kubrick Movie
candidates = ['Shining', 'The shins', 'Shining, The']
most_similar(string, candidates)
==> 'Shining, The'
Doing a "literal string comparison", I usually use the Levenshtein distance or ratio in this case. However, I'd like to do a more sophisticated similarity test so that the best match in the above case is Shining, The
.
I'm guessing that this is a common issue that has probably been solved extensively, so I was wondering what library/tool/etc. might be the best way to get what I'm trying to do?
You're looking for the gensim or fuzzywuzzy package.
In this specific case, you're probably leaning towards fuzzywuzzy
since you are just trying to do a string match.
gensim
is more for calculating similarity scores and vector representations for documents, paragraphs, sentences, words, corpora, etc... with the goal of capturing semantic/topical meaning rather than literal string matching.
So in your case, using fuzzy string matching, you might do:
from fuzzywuzzy import fuzz
fuzz.partial_ratio('Shining', 'The shins')
>>> 50
fuzz.partial_ratio('Shining', 'Shining, The')
>>> 100
fuzz.partial_ratio('Shining', 'unrelated')
>>> 14
The partial_ratio
function is case sensitive, so you might want to lowercase all of your inputs. It'll output a score between 0 and 100 (100 being a very strong match). It's up to you how you filter out matches from there, maybe use a threshold: if score > 75: its a match
.
I would recommend looking into the different functions in the fuzzywuzzy
package, see what works best for you're case.