Is there any function inside Python that can accept multiple rows of strings and return a percentage of how much similarity they have? something like SequenceMatcher
but for multiple strings.
For example we have the following sentences
Hello how are you?
Hi how are you?
hi how are you doing?
Hey how is your day?
I want to be able to get a percentage based on how similar the sentences are to each other
Let's say we have these three sentences
Hello how are you?
Hello how are you?
Hello how are you?
Then we should get 100% similar
but if we have
Hello how are you?
Hello how are you?
hola como estats?
then we should get a number to around 67% similarity.
You can use pandas
to operate with a dataframe, itertools.combinations
to calculate the combinations of 2 strings from your list and difflib.SequenceMatcher
for the similarity calculation:
import pandas as pd
import itertools
from difflib import SequenceMatcher
def similarity(a,b):
seq = SequenceMatcher(a=a, b=b)
return seq.ratio()
strings = ['Hello how are you?', 'Hi how are you?', 'hi how are you doing?', 'Hey how is your day?']
combinations = itertools.combinations(strings,2)
df = pd.DataFrame(list(combinations))
df['similarity'] = df.apply(lambda x: similarity(x[0],x[1]), axis=1)
df.similarity.mean()
0.68