hope everyone's well. I'm trying to use the following method to efficiently calculate cosine similarity of a (29805, 40) sparse matrix, created by HashingVectorizing (Sklearn) my dataset. The method below is originally from @Waylon Flinn's answer to this question.
def cosine_sim(A):
similarity = np.dot(A, A.T)
# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)
# inverse squared magnitude
inv_square_mag = 1 / square_mag
# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0
# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
return cosine.T * inv_mag
When I try with a dummy matrix, everything works fine.
A = np.random.randint(0, 2, (10000, 100)).astype(float)
cos_sim = cosine_sim(A)
but when I try with my own matrix..
cos_sim = cosine_sim(sparse_matrix)
I get
ValueError: Input must be 1- or 2-d.
Now, calling .shape on my matrix returns (29805, 40). How is that not 2-d? Can someone tell me what I'm doing wrong here? The error occurs here (from jupyter notebook traceback):
----> 6 square_mag = np.diag(similarity)
Thanks for reading! For context, calling sparse_matrix returns this
<29805x40 sparse matrix of type '<class 'numpy.float64'>'
with 1091384 stored elements in Compressed Sparse Row format>
np.diag
starts with
v = asanyarray(v)
similarity = np.dot(A, A.T)
works with A
sparse, because it delegates the action to the sparse matrix multiplication. The result will be a sparse
matrix - you can check that yourself.
But then try to pass that to np.asanyarray
.