python python-3.x string search levenshtein-distance

Python string similarity (with complexity)

I have a string I would like to match against a list of candidates. Here is an example:

# ignore case
string = "The Shining" # The Stanley Kubrick Movie
candidates = ['Shining', 'The shins', 'Shining, The'] 
most_similar(string, candidates)
==> 'Shining, The'

Doing a "literal string comparison", I usually use the Levenshtein distance or ratio in this case. However, I'd like to do a more sophisticated similarity test so that the best match in the above case is Shining, The.

I'm guessing that this is a common issue that has probably been solved extensively, so I was wondering what library/tool/etc. might be the best way to get what I'm trying to do?

Solution

You're looking for the gensim or fuzzywuzzy package.

In this specific case, you're probably leaning towards fuzzywuzzy since you are just trying to do a string match.

gensim is more for calculating similarity scores and vector representations for documents, paragraphs, sentences, words, corpora, etc... with the goal of capturing semantic/topical meaning rather than literal string matching.

So in your case, using fuzzy string matching, you might do:

from fuzzywuzzy import fuzz

fuzz.partial_ratio('Shining', 'The shins')
>>> 50 

fuzz.partial_ratio('Shining', 'Shining, The')
>>> 100

fuzz.partial_ratio('Shining', 'unrelated')
>>> 14

The partial_ratio function is case sensitive, so you might want to lowercase all of your inputs. It'll output a score between 0 and 100 (100 being a very strong match). It's up to you how you filter out matches from there, maybe use a threshold: if score > 75: its a match.

I would recommend looking into the different functions in the fuzzywuzzy package, see what works best for you're case.