Why CSR format outperforms CSC format in selecting rows?

I am doing this little experiment on CSR and CSC format based on scipy.sparse. As documentation claims, CSC format are more efficient than CSR format when dealing with column operations. So I did this small experiment:

from scipy.sparse import diags
d = diags([-1, 1, 1], [-1, 0, -1], shape= (100, 100))
%timeit d.tocsc()[:, 1]
%timeit d.tocsr()[:, 1]

Then I suppose I got the row and column in the wrong way. So I tried the other way around:

%timeit d.tocsc()[1]
%timeit d.tocsr()[1]

But in cases CSR outperforms CSC in timing. I was wondering if there is any structural reason for such results, or are my tests just biased?

Solution

Indexing of sparse matrices is complicated, with a number of variables that could affect times. Your test case is symmetric, so csr and csc formats will be nearly the same. Things could be different if the shape is rectangular, or the layout different.

Also indexing with a scalar versus a slice versus a list may be different.

Keep in mind that sparse indexing, regardless of format, is much slower than numpy.ndarray. Try not to use sparse if you need to iterate. (And if you must iterate, consider working with the 'raw' sparse attributes directly.)

In [64]: d = sparse.diags([-1, 1, 1], [-1, 0, 1], shape= (100, 100))                 
In [65]: d                                                                           
Out[65]: 
<100x100 sparse matrix of type '<class 'numpy.float64'>'
    with 298 stored elements (3 diagonals) in DIAgonal format>

In [66]: %timeit d.tocsr()[:,1]                                                      
422 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [67]: %timeit d.tocsc()[:,1]                                                      
506 µs ± 5.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [68]: %timeit d.tocsc()[1,:]                                                      
506 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Separate the conversion from the indexing:

In [69]: %%timeit x=d.tocsr() 
    ...: x[:,1]                                                                              
121 µs ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [70]: %%timeit x=d.tocsr() 
    ...: x[1,:]                                                                           
118 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [71]: %%timeit x=d.tocsc() 
    ...: x[:,1]                                                                            
290 µs ± 5.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [72]: %%timeit x=d.tocsc() 
    ...: x[1,:]                                                                            
297 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I'm a little surprised that the csr times are consistently faster. But just a little. csc and csr use a common compressed base class (sparse.compressed._cs_matrix), but coding details seem to favor csr.

===

And just for the fun of it, index a lil format

In [73]: %%timeit x=d.tolil() 
    ...: x[1,:]                                                                            
76.4 µs ± 231 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [74]: %%timeit x=d.tolil() 
    ...: x[:,1]                                                                       
118 µs ± 647 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As expected from the lil storage, row indexing is faster, though the column index time isn't bad.

lil has a getrow which is the closest thing in sparse to a numpy view:

In [75]: %%timeit x=d.tolil() 
    ...: x.getrow(1)                                                                          
36.7 µs ± 233 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

===

Dense array indexing, even with conversion time is faster:

In [83]: %%timeit x=d.A 
    ...: x[:,1]                                                                          
277 ns ± 9.97 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [84]: timeit d.A[:,1]                                                             
197 µs ± 587 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)