I have a TF-IDF matrix of shape (149,1001). What is want is to compute the cosine similarity of last columns, with all columns
Here is what I did
from numpy import dot
from numpy.linalg import norm
for i in range(mat.shape[1]-1):
cos_sim = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
cos_sim
But this loop is making it slow. So, is there any efficient way? I want to do with numpy only
Leverage 2D
vectorized matrix-multiplication
Here's one with NumPy using matrix-multiplication on 2D data -
p1 = mat[:,-1].dot(mat[:,:-1])
p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
out1 = p1/p2
Explanation : p1
is the vectorized equivalent of looping of dot(mat[:,i], mat[:,-1])
. p2
is of (norm(mat[:,i])*norm(mat[:,-1]))
.
Sample run for verification -
In [57]: np.random.seed(0)
...: mat = np.random.rand(149,1001)
In [58]: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
In [59]: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
In [60]: np.allclose(out, out1)
Out[60]: True
Timings -
In [61]: %%timeit
...: out = np.empty(mat.shape[1]-1)
...: for i in range(mat.shape[1]-1):
...: out[i] = dot(mat[:,i], mat[:,-1])/(norm(mat[:,i])*norm(mat[:,-1]))
18.5 ms ± 977 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = norm(mat[:,:-1],axis=0)*norm(mat[:,-1])
...: out1 = p1/p2
939 µs ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# @yatu's soln
In [89]: a = mat
In [90]: %timeit cosine_similarity(a[None,:,-1] , a.T[:-1])
2.47 ms ± 461 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Further optimize on norm
with einsum
Alternatively, we could compute p2
with np.einsum
.
So, norm(mat[:,:-1],axis=0)
could be replaced by :
np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))
Hence, giving us a modified p2
:
p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
Timings on same setup as earlier -
In [82]: %%timeit
...: p1 = mat[:,-1].dot(mat[:,:-1])
...: p2 = np.sqrt(np.einsum('ij,ij->j',mat[:,:-1],mat[:,:-1]))*norm(mat[:,-1])
...: out1 = p1/p2
607 µs ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
30x+
speedup over loopy one!