Fast way to find indexes of nonzero entries for every row in a CSC matrix in Python

Here's the current implementation:

def nonzero_indexes_by_row(input):
    return [
        np.nonzero(row)[1] 
        for row in csr_matrix(input.T)
    ]

The matrix is very large(1.5M, 500K), since I'm accessing rows, I have to convert CSC to CSR first. The result would be a 2d list, each list contains a list of indexes that are nonzero corresponding to the row in the original matrix.

The current process takes 20 minutes. Is there a faster way?

Solution

Sure. You're pretty close to having an ideal solution, but you're allocating some unnecessary arrays. Here's a faster way:

from scipy import sparse
import numpy as np

def my_impl(csc):
    csr = csc.tocsr()
    return np.split(csr.indices, csr.indptr[1:-1])

def your_impl(input):
    return [
        np.nonzero(row)[1] 
        for row in sparse.csr_matrix(input)
    ]

## Results

# demo data
csc = sparse.random(15000, 5000, format="csc")

your_result = your_impl(csc)
my_result = my_impl(csc)

## Tests for correctness

# Same result
assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
# Right number of rows
assert len(my_result) == csc.shape[0]

## Speed

%timeit my_impl(csc)
# 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit your_impl(csc)
# 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Side question, why are you transposing the matrix? Wouldn't you then be getting the non-zero entries of the columns? If that's what you want, you don't even need to convert to csr and can just run:

np.split(csc.indices, csc.indptr[1:-1])