I want to apply the Jaccard coefficient on an excel file that has 5575 rows, 'id' and 'text', the thing that I want to mention is that I want the similarity of the two by two rows :
You could use Pandas shift
function to shift the index by a desired number of periods (one in this case), this will allow you to apply the jaccard_similarity
over the text
column in a two by two consecutive rows manner.
import pandas as pd
def jaccard_similarity(a, b):
a = set(a)
b = set(b)
# calucate jaccard similarity
return float(len(a.intersection(b))) / len(a.union(b))
df = pd.read_excel('sample.xlsx', index_col='id')
df['shift'] = df['text'].shift().fillna('')
df['jaccard_sim'] = df.apply(lambda r: jaccard_similarity(r['text'], r['shift']), axis=1)
print(df)
Output from df
text shift jaccard_sim
id
1 When you have a dream, you’ve got to grab it a... 0.000000
2 Nothing is impossible. The word itself says ‘I... When you have a dream, you’ve got to grab it a... 0.566667
3 There is nothing impossible to they who will try. Nothing is impossible. The word itself says ‘I... 0.692308
4 The bad news is time flies. The good news is y... There is nothing impossible to they who will try. 0.782609
5 Life has got all those twists and turns. You’v... The bad news is time flies. The good news is y... 0.730769
...
...