Search code examples
pythonpandassimilarity

pandas:calculate jaccard similarity for every row based on the value in another column


I have a dataframe as follows, only with more rows:

import pandas as pd

data = {'First':  ['First value', 'Second value','Third value'],
'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]}

df = pd.DataFrame (data, columns = ['First','Second'])

To calculate the jaccard similarity i found this piece online(not my solution):

def lexical_overlap(doc1, doc2): 
    words_doc1 = set(doc1) 
    words_doc2 = set(doc2)

    intersection = words_doc1.intersection(words_doc2)
    union = words_doc1.union(words_doc2)
    
    return float(len(intersection)) / len(union) * 100

what i would like to get as a result is for the measure to take each row of the Second column as doc and compare each pair iteratively and outputs a measure with the row name from the First column something like this :

First value and Second value = 80 

First value and Third value  = 95

Second value and Third value = 90

Solution

  • Well, I'd do it somewhat like this:

    from itertools import combinations
    
    for val in list(combinations(range(len(df)), 2)):
        firstlist = df.iloc[val[0],1]
        secondlist = df.iloc[val[1],1]
        
        value = round(lexical_overlap(firstlist,secondlist),2)
        
        print(f"{df.iloc[val[0],0]} and {df.iloc[val[1],0]}'s value is: {value}")
    

    Output:

    First value and Second value's value is: 33.33
    First value and Third value's value is: 14.29
    Second value and Third value's value is: 14.29