I am trying to find the cosine similarity of two documents represented as follows:
d1: [(0,1), (3,2), (6, 1)]
d2: [(1,1), (3,1), (5,4), (6,2)]
where each document is a topic-weight vector where topics are the first element in the tuple and the weight is the second element
I am not sure how to go about calculating cosine similarity in this case with this weighted scheme? Is there any module/package in Python that would let me do such a thing?
A very simple idea is to create a vector of the weights, and then use scipy.spatial.distance.cosine
to compute the cosine distance (which is equal to 1-similarity):
In [1]: from scipy.spatial.distance import cosine
In [2]: import numpy as np
In [3]: d1 = [(0,1), (3,2), (6, 1)]
In [4]: d2 = [(1,1), (3,1), (5,4), (6,2)]
In [5]: def get_weights(d):
...: w = [ 0. ] * N
...: for i, weight in d:
...: w[i] = weight
...: return np.array(w)
...:
In [6]: w1 = get_weights(d1)
In [7]: w2 = get_weights(d2)
In [8]: 1-cosine(w1, w2)
Out[8]: 0.3481553119113957