Search code examples
pythonarraysnumpyscipysparse-matrix

Efficiently compute columnwise sum of sparse array where every non-zero element is 1


I have a bunch of data in SciPy compressed sparse row (CSR) format. Of course the majority of elements is zero, and I further know that all non-zero elements have a value of 1. I want to compute sums over different subsets of rows of my matrix. At the moment I am doing the following:

import numpy as np
import scipy as sp
import scipy.sparse

# create some data with sparsely distributed ones
data = np.random.choice((0, 1), size=(1000, 2000), p=(0.95, 0.05))
data = sp.sparse.csr_matrix(data, dtype='int8')

# generate column-wise sums over random subsets of rows
nrand = 1000
for k in range(nrand):
    inds = np.random.choice(data.shape[0], size=100, replace=False)

    # 60% of time is spent here
    extracted_rows = data[inds]

    # 20% of time is spent here
    row_sum = extracted_rows.sum(axis=0)

The last few lines there are the bottleneck in a larger computational pipeline. As I annotated in the code, 60% of time is spent slicing the data from the random indices, and 20% is spent computing the actual sum.

It seems to me I should be able to use my knowledge about the data in the array (i.e., any non-zero value in the sparse matrix will be 1; no other values present) to compute these sums more efficiently. Unfortunately, I cannot figure out how. Dealing with just data.indices perhaps? I have tried other sparsity structures (e.g. CSC matrix), as well as converting to dense array first, but these approaches were all slower than this CSR matrix approach.


Solution

  • It is well known that indexing of sparse matrices is relatively slow. And there have SO questions about getting around that by accessing the data attributes directly.

    But first some timings. Using data and ind as you show I get

    In [23]: datad=data.A  # times at 3.76 ms per loop
    
    In [24]: timeit row_sumd=datad[inds].sum(axis=0)
    1000 loops, best of 3: 529 µs per loop
    
    In [25]: timeit row_sum=data[inds].sum(axis=0)
    1000 loops, best of 3: 890 µs per loop
    
    In [26]: timeit d=datad[inds]
    10000 loops, best of 3: 55.9 µs per loop
    
    In [27]: timeit d=data[inds]
    1000 loops, best of 3: 617 µs per loop
    

    The sparse version is slower than the dense one, but not by a lot. The sparse indexing is much slower, but its sum is somewhat faster.

    The sparse sum is done with a matrix product

    def sparse.spmatrix.sum
         ....
        return np.asmatrix(np.ones((1, m), dtype=res_dtype)) * self
    

    That suggests that faster way - turn inds into an appropriate array of 1s and multiply.

    In [49]: %%timeit
       ....: b=np.zeros((1,data.shape[0]),'int8')
       ....: b[:,inds]=1
       ....: rowmul=b*data
       ....: 
    1000 loops, best of 3: 587 µs per loop
    

    That makes the sparse operation about as fast as the equivalent dense one. (but converting to dense is much slower)

    ==================

    The last time test is missing the np.asmatrix that is present in the sparse sum. But times are similar, and the results are the same

    In [232]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x1=np.asmatrix(b)*data
    1000 loops, best of 3: 661 µs per loop
    
    In [233]: timeit b=np.zeros((1,data.shape[0]),'int8'); b[:,inds]=1; x2=b*data
    1000 loops, best of 3: 605 µs per loop
    

    One produces a matrix, the other an array. But both are doing a matrix product, 2nd dim of B against 1st of data. Even though b is an array, the task is actually delegated to data and its matrix product - in a not so transparent a way.

    In [234]: x1
    Out[234]: matrix([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
    
    In [235]: x2
    Out[235]: array([[9, 9, 5, ..., 9, 5, 3]], dtype=int8)
    

    b*data.A is element multiplication and raises an error; np.dot(b,data.A) works but is slower.

    Newer numpy/python has a matmul operator. I see the same time pattern:

    In [280]: timeit b@dataA           # dense product
    100 loops, best of 3: 2.64 ms per loop
    
    In [281]: timeit [email protected]           # slower due to `.A` conversion
    100 loops, best of 3: 6.44 ms per loop
    
    In [282]: timeit b@data             # sparse product
    1000 loops, best of 3: 571 µs per loop
    

    np.dot may also delegate action to sparse, though you have to be careful. I just hung my machine with np.dot(csr_matrix(b),data.A).