Search code examples
pythonpython-3.xstringsearchlevenshtein-distance

Python string similarity (with complexity)


I have a string I would like to match against a list of candidates. Here is an example:

# ignore case
string = "The Shining" # The Stanley Kubrick Movie
candidates = ['Shining', 'The shins', 'Shining, The'] 
most_similar(string, candidates)
==> 'Shining, The'

Doing a "literal string comparison", I usually use the Levenshtein distance or ratio in this case. However, I'd like to do a more sophisticated similarity test so that the best match in the above case is Shining, The.

I'm guessing that this is a common issue that has probably been solved extensively, so I was wondering what library/tool/etc. might be the best way to get what I'm trying to do?


Solution

  • You're looking for the gensim or fuzzywuzzy package.

    In this specific case, you're probably leaning towards fuzzywuzzy since you are just trying to do a string match.

    gensim is more for calculating similarity scores and vector representations for documents, paragraphs, sentences, words, corpora, etc... with the goal of capturing semantic/topical meaning rather than literal string matching.

    So in your case, using fuzzy string matching, you might do:

    from fuzzywuzzy import fuzz
    
    fuzz.partial_ratio('Shining', 'The shins')
    >>> 50 
    
    fuzz.partial_ratio('Shining', 'Shining, The')
    >>> 100
    
    fuzz.partial_ratio('Shining', 'unrelated')
    >>> 14
    

    The partial_ratio function is case sensitive, so you might want to lowercase all of your inputs. It'll output a score between 0 and 100 (100 being a very strong match). It's up to you how you filter out matches from there, maybe use a threshold: if score > 75: its a match.

    I would recommend looking into the different functions in the fuzzywuzzy package, see what works best for you're case.