Search code examples
pythonnumpyscipysparse-matrix

How this mixed scipy.sparse / numpy program should be handled


I am currently trying to use numpy as well a scipy in order to handle sparse matrices, but, in the process of evaluating sparsity of a matrix, I had trouble, and I don't know how the following behaviour should be understood:

import numpy as np
import scipy.sparse as sp

a=sp.csc.csc_matrix(np.ones((3,3)))
a
np.count_nonzero(a)

When evaluating a, and non zero count, using the above code, I saw this output in ipython:

Out[9]: <3x3 sparse matrix of type '' with 9 stored elements in Compressed Sparse Column format>

Out[10]: 1

I think there is something I don't understand here. A 3*3 matrix full of 1, should have 9 non-zero term, and this is the answer I get if I use the toarray method from scipy. I may be using numpy and scipy the wrong way ?


Solution

  • The nonzero count is available as an attribute:

    In [295]: a=sparse.csr_matrix(np.arange(9).reshape(3,3))
    In [296]: a
    Out[296]: 
    <3x3 sparse matrix of type '<class 'numpy.int32'>'
        with 8 stored elements in Compressed Sparse Row format>
    In [297]: a.nnz
    Out[297]: 8
    

    As Warren commented, you can't count on numpy functions working on sparse. Use sparse functions and methods. Sometimes numpy functions are written in a way that invokes the arrays own method, in which the function call might work. But that is true only on a case by case basis.

    In Ipython I make heavy use of the a.<tab> to get a list of completions (attributes and methods). I also use the function?? to look at the code.

    In the case of np.count_nonzero I see no code - it is compiled, and only works on np.ndarray objects.

    np.nonzero(a) works. Look at its code, and see that it looks for the array's method: nonzero = a.nonzero

    The sparse nonzero method code is:

    def nonzero(self):
        ...
        # convert to COOrdinate format
        A = self.tocoo()
        nz_mask = A.data != 0
        return (A.row[nz_mask],A.col[nz_mask])
    

    The A.data !=0 line is there because it is possible to construct a matrix with 0 data elements, particularly if you use the coo (data,(i,j)) format. So apart from that caution, the nnz attribute gives a reliable count.

    Doing a.<tab> I also see a.getnnz and a.eleminate_zeros methods, which may be helpful if you are worried about sneaky zeros.

    Sometimes it is useful to work directly with the data attributes of a sparse matrix. It's safer to access them than to modify them. But each sparse format has different attributes. In the csr case you can do:

    In [306]: np.count_nonzero(a.data)
    Out[306]: 8