I have a matrix A in CSC-format, of which I index just a single column
b = A[:,col]
resulting in a (n x 1) matrix. What I want to do is:
v = M * b
where M is a (n x n) matrix in CSR. The result v is a (n x 1) CSR-matrix. I need to iterate the values in v (not including the 0s actually) and retrieve the index of one element meeting a special criteria (note: sparse matrix formats were not chosen to fit that particular operation, but general matrix x matrix-products should be fastest with CSR * CSC, right?)
The problem is, that iterating the entries in the CSR-formatted vector (0 < i < n: v[i,0]) is terribly slow and I actually waste quite some memory since v is not sparse anymore.
Could anyone tell me how to perform these operations as such, that I can quickly iterate over the result vector, keeping the copy-related memory overhead small?
IN: M (CSR-Matrix), A (CSC-Matrix), col_index
v = M * A[:,col_index]
for entries in v:
do stuff
Is it also possible to somehow speed up "advanced" indexing over columns in a CSC-Matrix? At some other point in the code, I have to extract a submatrix of A (cannot be reformulated to allow for slicing, therefore using an index array), that includes a given subset of all columns. A[:,idxlist] takes quite long when line-profiling.
Looking forward to your suggestions
The scipy sparse module is getting better every release, but it is quite obviously work in progress, so there is a lot of optimization you can do by accessing the innards of the objects directly. E.g. your case:
>>> a = sps.rand(5, 20, density=0.2, format='csr')
>>> b = sps.rand(20, 1, density=0.2, format='csc')
>>> c = a * b
>>> c.A
array([[ 0.30331594],
[ 0. ],
[ 0.12198742],
[ 0.34350077],
[ 0. ]])
You can get the non-zero entries of c
as c.data
:
>>> c.data
array([ 0.30331594, 0.12198742, 0.34350077])
Getting the corresponding row number is a little trickier. Probably the easiest would be to convert your output to CSC format, since them you have them directly as c.indices
, and c.data
will still be the same as before:
>>> c.tocsc().indices
array([0, 2, 3])
>>> c.tocsc().data
array([ 0.30331594, 0.12198742, 0.34350077])
But you can extract them without doing the conversion if you don't fancy it:
>>> np.where(c.indptr[:-1] != c.indptr[1:])[0]
array([0, 2, 3], dtype=int64)
So if you wanted to find, e.g. the largest value and its row number, you could do:
>>> row_idx = np.where(c.indptr[:-1] != c.indptr[1:])[0]
>>> idx = np.argmax(c.data)
>>> c.data[idx], row_idx[idx]
(0.34350077450601624, 3)