Search code examples
pythonnumpynlplinear-algebracosine-similarity

Calculating cosine similarity: ValueError: Input must be 1- or 2-d


hope everyone's well. I'm trying to use the following method to efficiently calculate cosine similarity of a (29805, 40) sparse matrix, created by HashingVectorizing (Sklearn) my dataset. The method below is originally from @Waylon Flinn's answer to this question.

def cosine_sim(A):

    similarity = np.dot(A, A.T)

    # squared magnitude of preference vectors (number of occurrences)
    square_mag = np.diag(similarity)

    # inverse squared magnitude
    inv_square_mag = 1 / square_mag

    # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
    inv_square_mag[np.isinf(inv_square_mag)] = 0

    # inverse of the magnitude
    inv_mag = np.sqrt(inv_square_mag)

    # cosine similarity (elementwise multiply by inverse magnitudes)
    cosine = similarity * inv_mag
    return cosine.T * inv_mag

When I try with a dummy matrix, everything works fine.

A = np.random.randint(0, 2, (10000, 100)).astype(float)
cos_sim = cosine_sim(A)

but when I try with my own matrix..

cos_sim = cosine_sim(sparse_matrix)

I get

ValueError: Input must be 1- or 2-d.

Now, calling .shape on my matrix returns (29805, 40). How is that not 2-d? Can someone tell me what I'm doing wrong here? The error occurs here (from jupyter notebook traceback):

----> 6     square_mag = np.diag(similarity)

Thanks for reading! For context, calling sparse_matrix returns this

<29805x40 sparse matrix of type '<class 'numpy.float64'>'
with 1091384 stored elements in Compressed Sparse Row format> 

Solution

  • np.diag starts with

     v = asanyarray(v)
    

    similarity = np.dot(A, A.T) works with A sparse, because it delegates the action to the sparse matrix multiplication. The result will be a sparse matrix - you can check that yourself.

    But then try to pass that to np.asanyarray.