Search code examples
pythonscikit-learncosine-similaritytrigonometry

cosine similarity of documents with weights


I am trying to find the cosine similarity of two documents represented as follows:

d1: [(0,1), (3,2), (6, 1)]
d2: [(1,1), (3,1), (5,4), (6,2)]

where each document is a topic-weight vector where topics are the first element in the tuple and the weight is the second element

I am not sure how to go about calculating cosine similarity in this case with this weighted scheme? Is there any module/package in Python that would let me do such a thing?


Solution

  • A very simple idea is to create a vector of the weights, and then use scipy.spatial.distance.cosine to compute the cosine distance (which is equal to 1-similarity):

    In [1]: from scipy.spatial.distance import cosine
    In [2]: import numpy as np
    In [3]: d1 = [(0,1), (3,2), (6, 1)]
    In [4]: d2 = [(1,1), (3,1), (5,4), (6,2)]
    In [5]: def get_weights(d):
       ...:     w = [ 0. ] * N
       ...:     for i, weight in d:
       ...:         w[i] = weight
       ...:     return np.array(w)
       ...: 
    
    In [6]: w1 = get_weights(d1)
    In [7]: w2 = get_weights(d2)
    In [8]: 1-cosine(w1, w2)
    Out[8]: 0.3481553119113957