Search code examples
pythonpandasnlpsequencematcher

Comparing strings within two columns in pandas with SequenceMatcher


I am trying to determine the similarity of two columns in a pandas dataframe:

Text1                                                                             All
Performance results achieved by the approaches submitted to this Challenge.       The six top approaches and three others outperform the strong baseline.
Accuracy is one of the basic principles of perfectionist.                             Where am I?

I would like to compare 'Performance results ... ' with 'The six...' and 'Accuracy is one...' with 'Where am I?'. The first row should have a higher similarity degree between the two columns as it includes some words; the second one should be equal to 0 as no words are in common between the two columns.

To compare the two columns I've used SequenceMatcher as follows:

from difflib import SequenceMatcher

ratio = SequenceMatcher(None, df.Text1, df.All).ratio()

but it seems to be wrong the use of df.Text1, df.All.

Can you tell me why?


Solution

    • SequenceMatcher isn't designed for a pandas series.
    • You could .apply the function.
    • SequenceMatcher Examples
      • With isjunk=None even spaces are not considered junk.
      • With isjunk=lambda y: y == " " considers spaces as junk.
    from difflib import SequenceMatcher
    import pandas as pd
    
    data = {'Text1': ['Performance results achieved by the approaches submitted to this Challenge.', 'Accuracy is one of the basic principles of perfectionist.'],
            'All': ['The six top approaches and three others outperform the strong baseline.', 'Where am I?']}
    
    df = pd.DataFrame(data)
    
    # isjunk=lambda y: y == " "
    df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(lambda y: y == " ", x[0], x[1]).ratio(), axis=1)
    
    # display(df)
                                                                             Text1                                                                      All     ratio
    0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.356164
    1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.088235
    
    # isjunk=None
    df['ratio'] = df[['Text1', 'All']].apply(lambda x: SequenceMatcher(None, x[0], x[1]).ratio(), axis=1)
    
    # display(df)
                                                                             Text1                                                                      All     ratio
    0  Performance results achieved by the approaches submitted to this Challenge.  The six top approaches and three others outperform the strong baseline.  0.410959
    1                    Accuracy is one of the basic principles of perfectionist.                                                              Where am I?  0.117647