Search code examples
pythonscipysparse-matrix

Most efficient way of accessing non-zero values in row/column in scipy.sparse matrix


What is the fastest or, failing that, least wordy way of accessing all non-zero values in a row row or column col of a scipy.sparse matrix A in CSR format?

Would doing it in another format (say, COO) be more efficient?

Right now, I use the following:

A[row, A[row, :].nonzero()[1]]

or

A[A[:, col].nonzero()[0], col]

Solution

  • For a problem like this is pays to understand the underlying data structures for the different formats:

    In [672]: A=sparse.csr_matrix(np.arange(24).reshape(4,6))
    In [673]: A.data
    Out[673]: 
    array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
           18, 19, 20, 21, 22, 23], dtype=int32)
    In [674]: A.indices
    Out[674]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
    In [675]: A.indptr
    Out[675]: array([ 0,  5, 11, 17, 23], dtype=int32)
    

    The data values for a row are a slice within A.data, but identifying that slice requires some knowledge of the A.indptr (see below)

    For the coo.

    In [676]: Ac=A.tocoo()
    In [677]: Ac.data
    Out[677]: 
    array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
           18, 19, 20, 21, 22, 23], dtype=int32)
    In [678]: Ac.row
    Out[678]: array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3], dtype=int32)
    In [679]: Ac.col
    Out[679]: array([1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], dtype=int32)
    

    Note that A.nonzeros() converts to coo and returns the row and col attributes (more or less - look at its code).

    For the lil format; data is stored by row in lists:

    In [680]: Al=A.tolil()
    In [681]: Al.data
    Out[681]: 
    array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17],
           [18, 19, 20, 21, 22, 23]], dtype=object)
    In [682]: Al.rows
    Out[682]: 
    array([[1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5],
           [0, 1, 2, 3, 4, 5]], dtype=object)
    

    ===============

    Selecting a row of A works, though in my experience that tends to be a bit slow, in part because it has to create a new csr matrix. Also your expression seems wordier than needed.

    Looking at my first row which has a 0 element (the others are too dense):

    In [691]: A[0, A[0,:].nonzero()[1]].A
    Out[691]: array([[1, 2, 3, 4, 5]], dtype=int32)
    

    The whole row, expressed as a dense array is:

    In [692]: A[0,:].A
    Out[692]: array([[0, 1, 2, 3, 4, 5]], dtype=int32)
    

    but the data attribute of that row is the same as your selection

    In [693]: A[0,:].data
    Out[693]: array([1, 2, 3, 4, 5], dtype=int32)
    

    and with the lil format

    In [694]: Al.data[0]
    Out[694]: [1, 2, 3, 4, 5]
    

    A[0,:].tocoo() doesn't add anything.

    Direct access to attributes of a csr and lil isn't that good when picking columns. For that csc is better, or lil of the transpose.

    Direct access to the csr data, with the aid of indptr, would be:

    In [697]: i=0; A.data[A.indptr[i]:A.indptr[i+1]]
    Out[697]: array([1, 2, 3, 4, 5], dtype=int32)
    

    Calculations using the csr format routinely iterate through indptr like this, getting the values of each row - but they do this in compiled code.

    A recent related topic, seeking the product of nonzero elements by row: Multiplying column elements of sparse Matrix

    There I found the reduceat using indptr was quite fast.

    Another tool when dealing with sparse matrices is multiplication

    In [708]: (sparse.csr_matrix(np.array([1,0,0,0])[None,:])*A)
    Out[708]: 
    <1x6 sparse matrix of type '<class 'numpy.int32'>'
        with 5 stored elements in Compressed Sparse Row format>
    

    csr actually does sum with this kind of multiplication. And if my memory is correct, it actually performs A[0,:] this way

    Sparse matrix slicing using list of int