Search code examples
pythondictionarysimilarity

Return 'similar score' based on two dictionaries' similarity in Python?


I know it's possible to return how similar two strings are by using the following function:

from difflib import SequenceMatcher
def similar(a, b):
    output=SequenceMatcher(None, a, b).ratio()
    return output

In [37]: similar("Hey, this is a test!","Hey, man, this is a test, man.")
Out[37]: 0.76
In [38]: similar("This should be one.","This should be one.")
Out[38]: 1.0

But is it possible to score two dictionaries based on the similarity of keys and their corresponding values? Not a number of in common keys, or what is in common, but a score from 0 to 1, like the example above with strings.

I'm trying to find the similarity score between ratings['Shane'] and ratings['Joe'] in this dictionary:

ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}

I am using Python 2.7.10


Solution

  • import math
    
    ratings={'Shane': {'127 Hours': 3.0, 'Avatar': 4.0, 'Nonstop': 5.0}, 'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
    
    def cosine_similarity(vec1,vec2):
            sum11, sum12, sum22 = 0, 0, 0
            for i in range(len(vec1)):
                x = vec1[i]; y = vec2[i]
                sum11 += x*x
                sum22 += y*y
                sum12 += x*y
            return sum12/math.sqrt(sum11*sum22)
    
    list1 = list(ratings['Shane'].values())
    list2 =  list(ratings['Joe'].values())
    
    sim = cosine_similarity(list1,list2)
    print(sim)
    

    output

    o/p : 0.9205746178983233
    

    Updated When i use :

    ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
             'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0}}
    

    output :0.9574271077563381

    Update2: Normalized length and considered keys

    from math import*
    
    
    ratings={'Shane': {'127 Hours': 5.0, 'Avatar': 4.0, 'Nonstop': 5.0},
             'Joe': {'127 Hours': 5.0, 'Taken 3': 4.0, 'Avatar': 5.0, 'Nonstop': 3.0},
             'Bob': {'Panic Room':5.0,'Nonstop':5.0}}
    
    
    def square_rooted(x):
    
        return round(sqrt(sum([a*a for a in x])),3)
    
    def cosine_similarity(x,y):
    
        input1 = {}
        input2 = {}
        vector2 = []
        vector1 =[]
    
        if len(x) > len(y):
            input1 = x
            input2 = y
        else:
            input1 = y
            input2 = x
    
    
        vector1 = list(input1.values())
    
        for k in input1.keys():    # Normalizing input vectors. 
            if k in input2:
                vector2.append(float(input2[k])) #picking the values for the common keys from input 2
            else :
                vector2.append(float(0))
    
    
        numerator = sum(a*b for a,b in zip(vector2,vector1))
        denominator = square_rooted(vector1)*square_rooted(vector2)
        return round(numerator/float(denominator),3)
    
    
    print("Similarity between Shane and Joe")
    print (cosine_similarity(ratings['Shane'],ratings['Joe']))
    
    print("Similarity between Joe and Bob")
    print (cosine_similarity(ratings['Joe'],ratings['Bob']))
    
    print("Similarity between Shane and Bob")
    print (cosine_similarity(ratings['Shane'],ratings['Bob']))
    

    output:

    Similarity between Shane and Joe
    0.887
    Similarity between Joe and Bob
    0.346
    Similarity between Shane and Bob
    0.615
    

    Nice explanation between jaccurd and cosine : https://datascience.stackexchange.com/questions/5121/applications-and-differences-for-jaccard-similarity-and-cosine-similarity

    i am using Python 3.4

    NOTE: I have assigned 0 to missing values. But you can assign some proper values too. Refer : http://www.analyticsvidhya.com/blog/2015/02/7-steps-data-exploration-preparation-building-model-part-2/