I have a scipy.sparse matrix with 45671x45671 elements. In this matrix, some rows contain only '0' value.
My question is, how to divide each row values by the row sum. Obviously, with for loop it's work, but I look for an efficient method...
I already tried :
matrix / matrix.sum(1)
but I have MemoryError
issue.matrix / scs.csc_matrix((matrix.sum(axis=1)))
but ValueError: inconsistent shapes
Moreover, I want to skip rows with only '0' values.
So, if you have any solution...
Thank you in advance !
I have an M
hanging around:
In [241]: M
Out[241]:
<6x3 sparse matrix of type '<class 'numpy.uint8'>'
with 6 stored elements in Compressed Sparse Row format>
In [242]: M.A
Out[242]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]], dtype=uint8)
In [243]: M.sum(1) # dense matrix
Out[243]:
matrix([[1],
[1],
[1],
[1],
[1],
[1]], dtype=uint32)
In [244]: M/M.sum(1) # dense matrix - full size of M
Out[244]:
matrix([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.]])
That will explain the memory error - if M
is so large that M.A
produces a memory error.
In [262]: S = sparse.csr_matrix(M.sum(1))
In [263]: S.shape
Out[263]: (6, 1)
In [264]: M.shape
Out[264]: (6, 3)
In [265]: M/S
....
ValueError: inconsistent shapes
I'm not entirely sure what is going on here.
Element wise multiplication works
In [266]: M.multiply(S)
Out[266]:
<6x3 sparse matrix of type '<class 'numpy.uint32'>'
with 6 stored elements in Compressed Sparse Row format>
So it should work if I construct S
as S = sparse.csr_matrix(1/M.sum(1))
If some of the rows sum to zero, you have a division by zero problem.
If I modify M
to have 0 row
In [283]: M.A
Out[283]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]], dtype=uint8)
In [284]: S = sparse.csr_matrix(1/M.sum(1))
/usr/local/bin/ipython3:1: RuntimeWarning: divide by zero encountered in true_divide
#!/usr/bin/python3
In [285]: S.A
Out[285]:
array([[ 1.],
[ 1.],
[ inf],
[ 1.],
[ 1.],
[ 1.]])
In [286]: M.multiply(S)
Out[286]:
<6x3 sparse matrix of type '<class 'numpy.float64'>'
with 5 stored elements in Compressed Sparse Row format>
In [287]: _.A
Out[287]:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 1., 0., 0.]])
This isn't the best M
to demonstrate this on, but it suggests a useful approach. The row sum will be dense, so you can clean up its inverse using the usual dense array approaches.