Search code examples
pythonmatrixsparse-matrix

Cosine distance between sparse matrices


I'm trying to understand how to use the csr_matrix API along with its cosine functionality, and I'm running into dimension mismatch issues.

I have the following two (3,3) matrices:

a = scipy.sparse.csr_matrix(np.reshape(np.arange(9), (3,3)))
b = scipy.sparse.csr_matrix(np.reshape(np.arange(9)*2+5, (3,3)))

And I want to compute the cosine similarity (or cosine distance) from a[0] and b[0] a-la cosine(a[0], b[0]).

If I print out the dimensions of a[0], b[0] I get:

(<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>)

So their dimension matches. But trying cosine(a[0], b[0]) results in ValueError: dimension mismatch. Any ideas why?


Solution

  • So the problem is that numpy.dot() is not aware of sparse matrices, per here: http://docs.scipy.org/doc/scipy/reference/sparse.html

    When I run

    >>> scipy.spatial.distance.cosine(a[0], b[0])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib64/python2.6/site-packages/scipy/spatial/distance.py", line 303, in cosine
        return (1.0 - (np.dot(u, v.T) / \
      File "/usr/lib64/python2.6/site-packages/scipy/sparse/base.py", line 287, in __mul__
        raise ValueError('dimension mismatch')
    ValueError: dimension mismatch
    

    The error is in np.dot(), which doesn't understand the csr_matrix object that has been passed as an argument. This can be fixed by:

    >>> scipy.spatial.distance.cosine(a[0].toarray(), b[0].toarray())
    array([[ 0.10197349]])
    

    Obviously not the answer you were looking for, by converting to a dense array you lose the performance advantages, but at least that is what is causing your problem.