I have numbered datasets of length 22 where each number can lie between 0 to 1 where it represents the percentage of that attribute.
[0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
[0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]
[0.01, 0.07, 0.59, 0.2, 0, 0, 0, 0, 0, 0.05, 0, 0, 0, 0, 0, 0, 0.07, 0, 0, 0, 0, 0]
[0.55, 0.12, 0.26, 0.01, 0, 0, 0, 0.01, 0.02, 0, 0, 0.01, 0, 0, 0.01, 0, 0.01, 0, 0, 0, 0, 0]
[0, 0.46, 0.43, 0.05, 0, 0, 0, 0, 0, 0, 0, 0.02, 0, 0, 0, 0, 0.02, 0.02, 0, 0, 0, 0]
How can I calculate the cosine similarity between such 2 datasets using Python?
According to the definition of Cosine similarity you just need to compute the normalized dot product of the two vectors a
and b
:
import numpy as np
a = [0.03, 0.15, 0.58, 0.1, 0, 0, 0.05, 0, 0, 0.07, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.01, 0]
b = [0.9, 0, 0.06, 0.02, 0, 0, 0, 0, 0.02, 0, 0, 0.01, 0, 0, 0, 0, 0.01, 0, 0, 0, 0, 0]
print np.dot(a, b) / np.linalg.norm(a) / np.linalg.norm(b)
Output:
0.115081383219