What is the most efficient way time & memory wise of writing this loop with sparse matrix (currently using csc_matrix)
for j in range(0, reducedsize):
xs = sum(X[:, j])
X[:, j] = X[:, j] / xs.data[0]
example:
reduced size (int) - 2500
X (csc_matrix) - 908x2500
The loop does iterate but it takes a very long time compared to just using numpy.
Always think on how the data is actually stored. A small example on a csc matrix.
shape=(5,5)
X=sparse.random(shape[0], shape[1], density=0.5, format='csc')
print(X.todense())
[[0.12146814 0. 0. 0.04075121 0.28749552]
[0. 0.92208639 0. 0.44279661 0. ]
[0.63509196 0.42334964 0. 0. 0.99160443]
[0. 0. 0.25941113 0.44669367 0.00389409]
[0. 0. 0. 0. 0.83226886]]
i=0 #first column
print(X.data[X.indptr[i]:X.indptr[i+1]])
[0.12146814 0.63509196]
A Numpy solution
So the only thing we want to do here is to modify the nonzero entries column by column in place. This can be easily done using a partly vectorized numpy solution. data
is just the array which contains all non zero values, indptr
stores the information where each column begins and ends.
def Numpy_csc_norm(data,indptr):
for i in range(indptr.shape[0]-1):
xs = np.sum(data[indptr[i]:indptr[i+1]])
#Modify the view in place
data[indptr[i]:indptr[i+1]]/=xs
Regarding performance this in-place solution is already not too bad. If you want to improve the performance further you could use Cython/Numba/ or some other compiled code which can be wrapped up in Python more or less easily.
A Numba solution
@nb.njit(fastmath=True,error_model="numpy",parallel=True)
def Numba_csc_norm(data,indptr):
for i in nb.prange(indptr.shape[0]-1):
acc=0
for j in range(indptr[i],indptr[i+1]):
acc+=data[j]
for j in range(indptr[i],indptr[i+1]):
data[j]/=acc
Performance
#Create a not to small example matrix
shape=(50_000,10_000)
X=sparse.random(shape[0], shape[1], density=0.001, format='csc')
#Not in-place from hpaulj
def hpaulj(X):
acc=X.sum(axis=0)
return X.multiply(sparse.csr_matrix(1./acc))
%timeit X2=hpaulj(X)
#6.54 ms ± 67.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#All 2 variants are in-place,
#but this shouldn't have a influence on the timings
%timeit Numpy_csc_norm(X.data,X.indptr)
#79.2 ms ± 914 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#parallel=False -> faster on tiny matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#626 µs ± 30.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
#parallel=True -> faster on larger matrices
%timeit Numba_csc_norm(X.data,X.indptr)
#185 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)