python excel pandas data-analysis similarity

Sentence similarity using jaccard coefficient on a excel file

I want to apply the Jaccard coefficient on an excel file that has 5575 rows, 'id' and 'text', the thing that I want to mention is that I want the similarity of the two by two rows :

Solution

You could use Pandas shift function to shift the index by a desired number of periods (one in this case), this will allow you to apply the jaccard_similarity over the text column in a two by two consecutive rows manner.

import pandas as pd

def jaccard_similarity(a, b):
    a = set(a)
    b = set(b)
    # calucate jaccard similarity
    return float(len(a.intersection(b))) / len(a.union(b))

df = pd.read_excel('sample.xlsx', index_col='id')

df['shift'] = df['text'].shift().fillna('')
df['jaccard_sim'] = df.apply(lambda r: jaccard_similarity(r['text'], r['shift']), axis=1)
print(df)

Output from df

                                                 text                                              shift  jaccard_sim
id
1   When you have a dream, you’ve got to grab it a...                                                        0.000000
2   Nothing is impossible. The word itself says ‘I...  When you have a dream, you’ve got to grab it a...     0.566667
3   There is nothing impossible to they who will try.  Nothing is impossible. The word itself says ‘I...     0.692308
4   The bad news is time flies. The good news is y...  There is nothing impossible to they who will try.     0.782609
5   Life has got all those twists and turns. You’v...  The bad news is time flies. The good news is y...     0.730769
...
...