I have a fairly large sparse matrix A
as a scipy.sparse.csr_matrix. It has the following properties:
A.shape: (77169, 77169)
A.nnz: 284811011
A.dtype: dtype('float16')
Now I have to convert it to a dense array using .toarray()
. My estimate for the memory usage would be
77169**2 * (16./8.) / 1024.**3 = 11.09... GB
which would be fine as my machine has ~48GB of memory. In fact, if I create a=np.ones((77169, 77169), dtype=np.float16)
that works fine and indeed a.nbytes/1024.**3 = 11.09...
. However, when I run A.toarray()
on the sparse matrix it packs all of memory and starts to use the swap at some point (it doesn't raise a MemoryError
). Whats going wrong here? Shouldn't it easily fit into my memory?
For the csr
toarray()
does
self.tocoo(copy=False).toarray(order=order, out=out)
you could go on to trace coo.toarray
, but I suspect it ends up using compiled code. But I suspect it ends up do the equivalent of:
In [715]: M=sparse.random(10,10,.2,format='csr')
In [717]: M=M.astype(np.float16)
In [718]: A = np.zeros(M.shape, M.dtype)
In [719]: Mo=M.tocoo()
In [720]: A[Mo.row, Mo.col] = Mo.data
Curiously though if I do
In [728]: Mo.toarray()
...
257 coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258 B.ravel('A'), fortran)
259 return B
...
ValueError: Output dtype not compatible with inputs.
It's having trouble with the float16
. Mo.astype(float).toarray()
works fine. I get this error even if use toarray(out=out)
with a float16 out, which makes me suspect coo_todense
has been compiled with just a couple dtype alternatives. Maybe I'll dig into that later.
In [741]: scipy.__version__
Out[741]: '0.18.1'
A comment in Warren's bug report
but the xxx_todense functions are actually A += X,
suggests that the copy from Mo.data
to A[]
is more complicated that what indicated. toarray
sums duplicates, as it would with Mo.tocsr()
or Mo.sum_duplicates()
.