Search code examples
pythonexcelpandasdata-analysissimilarity

Sentence similarity using jaccard coefficient on a excel file


I want to apply the Jaccard coefficient on an excel file that has 5575 rows, 'id' and 'text', the thing that I want to mention is that I want the similarity of the two by two rows : this is how the excel file looks :


Solution

  • You could use Pandas shift function to shift the index by a desired number of periods (one in this case), this will allow you to apply the jaccard_similarity over the text column in a two by two consecutive rows manner.

    import pandas as pd
    
    def jaccard_similarity(a, b):
        a = set(a)
        b = set(b)
        # calucate jaccard similarity
        return float(len(a.intersection(b))) / len(a.union(b))
    
    df = pd.read_excel('sample.xlsx', index_col='id')
    
    df['shift'] = df['text'].shift().fillna('')
    df['jaccard_sim'] = df.apply(lambda r: jaccard_similarity(r['text'], r['shift']), axis=1)
    print(df)
    

    Output from df

                                                     text                                              shift  jaccard_sim
    id
    1   When you have a dream, you’ve got to grab it a...                                                        0.000000
    2   Nothing is impossible. The word itself says ‘I...  When you have a dream, you’ve got to grab it a...     0.566667
    3   There is nothing impossible to they who will try.  Nothing is impossible. The word itself says ‘I...     0.692308
    4   The bad news is time flies. The good news is y...  There is nothing impossible to they who will try.     0.782609
    5   Life has got all those twists and turns. You’v...  The bad news is time flies. The good news is y...     0.730769
    ...
    ...