python string similarity sentence-similarity

Get similarity percentage on multiple strings

Is there any function inside Python that can accept multiple rows of strings and return a percentage of how much similarity they have? something like SequenceMatcher but for multiple strings.

For example we have the following sentences

Hello how are you?
Hi how are you?
hi how are you doing?
Hey how is your day?

I want to be able to get a percentage based on how similar the sentences are to each other

Let's say we have these three sentences

Hello how are you?
Hello how are you?
Hello how are you?

Then we should get 100% similar

but if we have

Hello how are you?
Hello how are you?
hola como estats?

then we should get a number to around 67% similarity.

Solution

You can use pandas to operate with a dataframe, itertools.combinations to calculate the combinations of 2 strings from your list and difflib.SequenceMatcher for the similarity calculation:

import pandas as pd
import itertools
from difflib import SequenceMatcher

def similarity(a,b):
    seq = SequenceMatcher(a=a, b=b)
    return seq.ratio()    

strings = ['Hello how are you?', 'Hi how are you?', 'hi how are you doing?', 'Hey how is your day?']
combinations = itertools.combinations(strings,2)

df = pd.DataFrame(list(combinations))
df['similarity'] = df.apply(lambda x: similarity(x[0],x[1]), axis=1)

df.similarity.mean()
0.68