Search code examples
pythonindexingscipysparse-matrixslice

Slicing Sparse Matrices in Scipy -- Which Types Work Best?


The SciPy Sparse Matrix tutorial is very good -- but it actually leaves the section on slicing un(der)developed (still in outline form -- see section: "Handling Sparse Matrices").

I will try and update the tutorial, once this question is answered.

I have a large sparse matrix -- currently in dok_matrix format.

import numpy as np
from scipy import sparse
M = sparse.dok_matrix((10**6, 10**6))

For various methods I want to be able to slice columns and for others I want to slice rows. Ideally I would use advanced-indexing (i.e. a boolean vector, bool_vect) with which to slice a sparse matrix M -- as in:

bool_vect = np.arange(10**6)%2  # every even index
out = M[bool_vect,:]            # Want to select every even row

or

out = M[:,bool_vect] # Want to select every even column

First off, dok_matrices do not support this -- but I think it works (slowly) if I first cast to lil_matrices, via sparse.lil_matrix(M)

As far as I can gather from the tutorial -- to slice columns I want to use CSC and to slice rows I want to slice CSR. So does that mean I should cast the matrix M via:

M.tocsc()[:,bool_vect]

or

M.tocsr()[bool_vect,:]

I am kinda guessing here and my code is slow because of it. Any help from someone who understands how this works would be appreciated. Thanks in advance.

If it turns out I should not be indexing my matrix with a boolean array, but rather a list of integers (indices) -- that is also something I would be happy to find out. Whichever is more efficient.

Finally -- this is a big matrix, so bonus points if this can happen in place / with broadcasting.


Solution

  • Ok, so I'm pretty sure the "right" way to do this is: if you are slicing columns, use tocsc() and slice using a list/array of integers. Boolean vectors does not seem to do the trick with sparse matrices -- the way it does with ndarrays in numpy. Which means the answer is.

    indices = np.where(bool_vect)[0]
    out1 = M.tocsc()[:,indices]
    out2 = M.tocsr()[indices,:]
    

    But question: is this the best way? Is this in place?

    In practice this does seem to be happening in place -- and it is much faster than prior attempts (using lil_matrix).