Search code examples
pythonnumpyscipyscikit-learncosine-similarity

How to compute the cosine similarity of a list of scipy.sparse.csr.csr_matrix


I have a list of sparse vector:

print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'scipy.sparse.csr.csr_matrix'>

I would like to use the function scikit learn cosine_similarity on downsampled_matrix but I get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-27-5997ca6abb2d> in <module>()
     19         downsample_matrix.append(vector)
     20         downsample_coefficient = 0
---> 21 similarity_matrix = cosine_similarity(downsample_matrix)
     22 plt.matshow(similarity_matrix)
     23 plt.show()

/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in cosine_similarity(X, Y, dense_output)
    908     # to avoid recursive import
    909 
--> 910     X, Y = check_pairwise_arrays(X, Y)
    911 
    912     X_normalized = normalize(X, copy=True)

/home/venv/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
    104     if Y is X or Y is None:
    105         X = Y = check_array(X, accept_sparse='csr', dtype=dtype,
--> 106                             warn_on_dtype=warn_on_dtype, estimator=estimator)
    107     else:
    108         X = check_array(X, accept_sparse='csr', dtype=dtype,

/home/venv/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: setting an array element with a sequence.

I have no problem when my list is made of nd.array:

print(type(downsample_matrix)) # Display <class 'list'>
print(type(downsample_matrix[0])) # Display <class 'numpy.ndarray'>

How can I apply cosine_similarity on my list of sparce vectors?


Solution

  • Create a small sparse matrix. Note that it is not a subclass of ndarray. It stores its data in 3 arrays - data and indices:

    In [196]: M = sparse.csr_matrix([[0,1,0],[1,0,1]])
    In [197]: M
    Out[197]: 
    <2x3 sparse matrix of type '<class 'numpy.int32'>'
        with 3 stored elements in Compressed Sparse Row format>
    In [198]: M.data
    Out[198]: array([1, 1, 1], dtype=int32)
    In [199]: M.indices
    Out[199]: array([1, 0, 2], dtype=int32)
    In [200]: M.indptr
    Out[200]: array([0, 1, 3], dtype=int32)
    

    If I try to make an array from a list of this matrix, I get an object dtype array, with 3 elements (pointers to this one matrix):

    In [201]: alist = [M,M,M]
    In [202]: np.array(alist)
    Out[202]: /usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:294: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
      "using <, >, or !=, instead.", SparseEfficiencyWarning)
    
    array([ <2x3 sparse matrix of type '<class 'numpy.int32'>'
        with 3 stored elements in Compressed Sparse Row format>,
           <2x3 sparse matrix of type '<class 'numpy.int32'>'
        with 3 stored elements in Compressed Sparse Row format>,
           <2x3 sparse matrix of type '<class 'numpy.int32'>'
        with 3 stored elements in Compressed Sparse Row format>], dtype=object)
    

    If in addition I specify the dtype, I get your error:

    In [203]: np.array(alist,dtype=int)
    ...
    ValueError: setting an array element with a sequence.
    

    It can't convert the list into an array of numbers.

    But if it's a list of dense arrays, I get a 3d array:

    In [204]: np.array([M.A,M.A,M.A],dtype=int)
    Out[204]: 
    array([[[0, 1, 0],
            [1, 0, 1]],
    
           [[0, 1, 0],
            [1, 0, 1]],
    
           [[0, 1, 0],
            [1, 0, 1]]])
    In [205]: _.shape
    Out[205]: (3, 2, 3)
    

    I can also concatenate the sparse matrices with a sparse version of vstack or hstack.

    In [206]: sparse.vstack(alist)
    Out[206]: 
    <6x3 sparse matrix of type '<class 'numpy.int32'>'
        with 9 stored elements in Compressed Sparse Row format>
    In [207]: _.A
    Out[207]: 
    array([[0, 1, 0],
           [1, 0, 1],
           [0, 1, 0],
           [1, 0, 1],
           [0, 1, 0],
           [1, 0, 1]], dtype=int32)
    

    Note the shape, (6,3). A sparse matrix is always 2d.

    sparse.vstack passes the task to sparse.bmat, which constructs a new sparse matrix from 'blocks'. It does so by joining the coo representations of the blocks with a appropriate offsets.

    Since cosine_similarity expects a 2d array or sparse matrix, you'll have to use the sparse.vstack to join the matrices. Or reshape the result of the 3d array join

    In [212]: cosine_similarity(sparse.vstack(alist))
    Out[212]: 
    array([[ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.],
           [ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.],
           [ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.]])
    In [213]: cosine_similarity( np.array([M.A,M.A,M.A],dtype=int).reshape(-1,3))
    Out[213]: 
    array([[ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.],
           [ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.],
           [ 1.,  0.,  1.,  0.,  1.,  0.],
           [ 0.,  1.,  0.,  1.,  0.,  1.]])