I am trying to determine the similarity of two columns in a pandas dataframe:
Text1 All
Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist. Where am I?
I would like to compare 'Performance results ... '
with 'The six...'
and 'Accuracy is one...'
with 'Where am I?'
.
The first row should have a higher similarity degree between the two columns as it includes some words; the second one should be equal to 0 as no words are in common between the two columns.
To compare the two columns I've used SequenceMatcher
as follows:
from difflib import SequenceMatcher
ratio = SequenceMatcher(None, df.Text1, df.All).ratio()
but it seems to be wrong the use of df.Text1, df.All
.
Can you tell me why?
SequenceMatcher
isn't designed for a pandas series..apply
the function.SequenceMatcher
Examples
isjunk=None
even spaces are not considered junk.isjunk=lambda y: y == " "
considers spaces as junk.from difflib import SequenceMatcher
import pandas as pd
data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}
df = pd.DataFrame(data)
# isjunk=lambda y: y == " "
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.356164
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.088235
# isjunk=None
df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)
# display(df)
Text1 All ratio
0 Performance results achieved by the approaches submitted to this Challenge. The six top approaches and three others outperform the strong baseline. 0.410959
1 Accuracy is one of the basic principles of perfectionist. Where am I? 0.117647