Search code examples
pythonscipyruntimesparse-matrixmemory-efficient

Scipy Sparse Matrix Loop Taking Forever - Need to make more efficient


What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)

for j in range(0, reducedsize):
    xs = sum(X[:, j])
    X[:, j] = X[:, j] / xs.data[0]

example:

reduced size (int) - 2500
X (csc_matrix) - 908x2500

The loop does iterate but it takes a very long time compared to just using numpy.


Solution

  • If you want to do it without any memory overhead (in-place)

    Always think on how the data is actually stored. A small example on a csc matrix.

    shape=(5,5)
    X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
    print(X.todense())
    
    [[0.12146814 0.         0.         0.04075121 0.28749552]
     [0.         0.92208639 0.         0.44279661 0.        ]
     [0.63509196 0.42334964 0.         0.         0.99160443]
     [0.         0.         0.25941113 0.44669367 0.00389409]
     [0.         0.         0.         0.         0.83226886]]
    
    i=0 #first column
    print(X.data[X.indptr[i]:X.indptr[i+1]])
    [0.12146814 0.63509196]
    

    A Numpy solution

    So the only thing we want to do here is to modify the nonzero entries column by column in place. This can be easily done using a partly vectorized numpy solution. data is just the array which contains all non zero values, indptr stores the information where each column begins and ends.

    def Numpy_csc_norm(data,indptr):
        for i in range(indptr.shape[0]-1):
            xs = np.sum(data[indptr[i]:indptr[i+1]])
            #Modify the view in place
            data[indptr[i]:indptr[i+1]]/=xs
    

    Regarding performance this in-place solution is already not too bad. If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.

    A Numba solution

    @nb.njit(fastmath=True,error_model="numpy",parallel=True)
    def Numba_csc_norm(data,indptr):
        for i in nb.prange(indptr.shape[0]-1):
            acc=0
            for j in range(indptr[i],indptr[i+1]):
                acc+=data[j]
            for j in range(indptr[i],indptr[i+1]):
                data[j]/=acc
    

    Performance

    #Create a not to small example matrix
    shape=(50_000,10_000)
    X=sparse.random(shape[0], shape[1], density=0.001, format='csc')
    
    #Not in-place from hpaulj
    def hpaulj(X):
        acc=X.sum(axis=0)
        return X.multiply(sparse.csr_matrix(1./acc))
    
    %timeit X2=hpaulj(X)
    #6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    #All 2 variants are in-place, 
    #but this shouldn't have a influence on the timings
    
    %timeit Numpy_csc_norm(X.data,X.indptr)
    #79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    #parallel=False -> faster on tiny matrices
    %timeit Numba_csc_norm(X.data,X.indptr)
    #626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    #parallel=True -> faster on larger matrices
    %timeit Numba_csc_norm(X.data,X.indptr)
    #185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)