Search code examples
scipylinear-algebrasparse-matrix

Scipy: Sparse Matrix to ndarray


I have a matrix A in CSC-format, of which I index just a single column

b = A[:,col]

resulting in a (n x 1) matrix. What I want to do is:

v  = M * b

where M is a (n x n) matrix in CSR. The result v is a (n x 1) CSR-matrix. I need to iterate the values in v (not including the 0s actually) and retrieve the index of one element meeting a special criteria (note: sparse matrix formats were not chosen to fit that particular operation, but general matrix x matrix-products should be fastest with CSR * CSC, right?)

The problem is, that iterating the entries in the CSR-formatted vector (0 < i < n: v[i,0]) is terribly slow and I actually waste quite some memory since v is not sparse anymore.

Could anyone tell me how to perform these operations as such, that I can quickly iterate over the result vector, keeping the copy-related memory overhead small?

IN: M (CSR-Matrix), A (CSC-Matrix), col_index
v = M * A[:,col_index]
for entries in v:
    do stuff

Is it also possible to somehow speed up "advanced" indexing over columns in a CSC-Matrix? At some other point in the code, I have to extract a submatrix of A (cannot be reformulated to allow for slicing, therefore using an index array), that includes a given subset of all columns. A[:,idxlist] takes quite long when line-profiling.

Looking forward to your suggestions


Solution

  • The scipy sparse module is getting better every release, but it is quite obviously work in progress, so there is a lot of optimization you can do by accessing the innards of the objects directly. E.g. your case:

    >>> a = sps.rand(5, 20, density=0.2, format='csr')
    >>> b = sps.rand(20, 1, density=0.2, format='csc')
    >>> c = a * b
    >>> c.A
    array([[ 0.30331594],
           [ 0.        ],
           [ 0.12198742],
           [ 0.34350077],
           [ 0.        ]])
    

    You can get the non-zero entries of c as c.data:

    >>> c.data
    array([ 0.30331594,  0.12198742,  0.34350077])
    

    Getting the corresponding row number is a little trickier. Probably the easiest would be to convert your output to CSC format, since them you have them directly as c.indices, and c.data will still be the same as before:

    >>> c.tocsc().indices
    array([0, 2, 3])
    >>> c.tocsc().data
    array([ 0.30331594,  0.12198742,  0.34350077])
    

    But you can extract them without doing the conversion if you don't fancy it:

    >>> np.where(c.indptr[:-1] != c.indptr[1:])[0]
    array([0, 2, 3], dtype=int64)
    

    So if you wanted to find, e.g. the largest value and its row number, you could do:

    >>> row_idx = np.where(c.indptr[:-1] != c.indptr[1:])[0]
    >>> idx = np.argmax(c.data)
    >>> c.data[idx], row_idx[idx]
    (0.34350077450601624, 3)