Search code examples
pythonstringsimilaritysentence-similarity

Get similarity percentage on multiple strings


Is there any function inside Python that can accept multiple rows of strings and return a percentage of how much similarity they have? something like SequenceMatcher but for multiple strings.

For example we have the following sentences

Hello how are you?
Hi how are you?
hi how are you doing?
Hey how is your day?

I want to be able to get a percentage based on how similar the sentences are to each other

Let's say we have these three sentences

Hello how are you?
Hello how are you?
Hello how are you?

Then we should get 100% similar

but if we have

Hello how are you?
Hello how are you?
hola como estats?

then we should get a number to around 67% similarity.


Solution

  • You can use pandas to operate with a dataframe, itertools.combinations to calculate the combinations of 2 strings from your list and difflib.SequenceMatcher for the similarity calculation:

    import pandas as pd
    import itertools
    from difflib import SequenceMatcher
    
    def similarity(a,b):
        seq = SequenceMatcher(a=a, b=b)
        return seq.ratio()    
    
    strings = ['Hello how are you?', 'Hi how are you?', 'hi how are you doing?', 'Hey how is your day?']
    combinations = itertools.combinations(strings,2)
    
    df = pd.DataFrame(list(combinations))
    df['similarity'] = df.apply(lambda x: similarity(x[0],x[1]), axis=1)
    
    df.similarity.mean()
    0.68