python scipy runtime sparse-matrix memory-efficient

Scipy Sparse Matrix Loop Taking Forever - Need to make more efficient

What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)

for j in range(0, reducedsize):
    xs = sum(X[:, j])
    X[:, j] = X[:, j] / xs.data[0]

example:

reduced size (int) - 2500
X (csc_matrix) - 908x2500

The loop does iterate but it takes a very long time compared to just using numpy.

Solution

If you want to do it without any memory overhead (in-place)

Always think on how the data is actually stored. A small example on a csc matrix.

shape=(5,5)
X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
print(X.todense())

[[0.12146814 0.         0.         0.04075121 0.28749552]
 [0.         0.92208639 0.         0.44279661 0.        ]
 [0.63509196 0.42334964 0.         0.         0.99160443]
 [0.         0.         0.25941113 0.44669367 0.00389409]
 [0.         0.         0.         0.         0.83226886]]

i=0 #first column
print(X.data[X.indptr[i]:X.indptr[i+1]])
[0.12146814 0.63509196]

A Numpy solution

So the only thing we want to do here is to modify the nonzero entries column by column in place. This can be easily done using a partly vectorized numpy solution. data is just the array which contains all non zero values, indptr stores the information where each column begins and ends.

def Numpy_csc_norm(data,indptr):
    for i in range(indptr.shape[0]-1):
        xs = np.sum(data[indptr[i]:indptr[i+1]])
        #Modify the view in place
        data[indptr[i]:indptr[i+1]]/=xs

Regarding performance this in-place solution is already not too bad. If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.

A Numba solution

@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def Numba_csc_norm(data,indptr):
    for i in nb.prange(indptr.shape[0]-1):
        acc=0
        for j in range(indptr[i],indptr[i+1]):
            acc+=data[j]
        for j in range(indptr[i],indptr[i+1]):
            data[j]/=acc

Performance

#Create a not to small example matrix
shape=(50_000,10_000)
X=sparse.random(shape[0], shape[1], density=0.001, format='csc')

#Not in-place from hpaulj
def hpaulj(X):
    acc=X.sum(axis=0)
    return X.multiply(sparse.csr_matrix(1./acc))

%timeit X2=hpaulj(X)
#6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#All 2 variants are in-place, 
#but this shouldn't have a influence on the timings

%timeit Numpy_csc_norm(X.data,X.indptr)
#79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#parallel=False -> faster on tiny matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

#parallel=True -> faster on larger matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)