Search code examples
pythonnumpysparse-matrix

Fast way to find indexes of nonzero entries for every row in a CSC matrix in Python


Here's the current implementation:

def nonzero_indexes_by_row(input):
    return [
        np.nonzero(row)[1] 
        for row in csr_matrix(input.T)
    ]

The matrix is very large(1.5M, 500K), since I'm accessing rows, I have to convert CSC to CSR first. The result would be a 2d list, each list contains a list of indexes that are nonzero corresponding to the row in the original matrix.

The current process takes 20 minutes. Is there a faster way?


Solution

  • Sure. You're pretty close to having an ideal solution, but you're allocating some unnecessary arrays. Here's a faster way:

    from scipy import sparse
    import numpy as np
    
    def my_impl(csc):
        csr = csc.tocsr()
        return np.split(csr.indices, csr.indptr[1:-1])
    
    def your_impl(input):
        return [
            np.nonzero(row)[1] 
            for row in sparse.csr_matrix(input)
        ]
    
    ## Results
    
    # demo data
    csc = sparse.random(15000, 5000, format="csc")
    
    your_result = your_impl(csc)
    my_result = my_impl(csc)
    
    ## Tests for correctness
    
    # Same result
    assert all(np.array_equal(x, y) for x, y in zip(your_result, my_result))
    # Right number of rows
    assert len(my_result) == csc.shape[0]
    
    ## Speed
    
    %timeit my_impl(csc)
    # 31 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit your_impl(csc)
    # 1.49 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Side question, why are you transposing the matrix? Wouldn't you then be getting the non-zero entries of the columns? If that's what you want, you don't even need to convert to csr and can just run:

    np.split(csc.indices, csc.indptr[1:-1])