For example, I have two 2D array as follow:
X = array([[4, 4, 4, 2],
[3, 1, 2, 2],
[1, 3, 3, 3],
[1, 3, 1, 2]])
Y = array([[2, 1, 1, 4],
[2, 1, 1, 1],
[4, 1, 4, 4],
[4, 2, 3, 4]])
I want to calculate cosine simarity between rows of X and Y. such as
def cos(feats1, feats2):
"""
Computing cosine distance
For similarity
"""
cos = np.dot(feats1, feats2) / (np.linalg.norm(feats1) * np.linalg.norm(feats2))
return cos
for i in range(a.shape[0]):
print(cos(a[i,:],b[i,:]))
Right now, I am using for loop to calculate cos distance between vectors. But the size of X and Y is like (1200000000, 512), it takes realy long time to calculate just using for loop.
My question is how i can utilize the power of algebra and numpy to speed up this process.
Or any other method that can perform this calculation more efficient.
Thanks
Possible in one single line: the trick is to just specify the axis over which perform the norm and the dot product.
X = np.random.randn(3,2)
Y = np.random.randn(3,2)
(X * Y).sum(axis=1) / np.linalg.norm(X, axis=1) / np.linalg.norm(Y, axis=1)
The first part, (X * Y).sum(axis=1)
takes care of computing the dot product. axis=1
specify that we perform the dot product over the columns, i.e. get a result for each row (the datapoints).
The second part simply computes the norm of each vector, with the same method.