Search code examples
pythonnumpymemoryscipysparse-matrix

Large memory usage of scipy.sparse.csr_matrix.toarray()


I have a fairly large sparse matrix A as a scipy.sparse.csr_matrix. It has the following properties:

A.shape: (77169, 77169)
A.nnz: 284811011
A.dtype: dtype('float16')

Now I have to convert it to a dense array using .toarray(). My estimate for the memory usage would be

77169**2 * (16./8.) / 1024.**3 = 11.09... GB

which would be fine as my machine has ~48GB of memory. In fact, if I create a=np.ones((77169, 77169), dtype=np.float16) that works fine and indeed a.nbytes/1024.**3 = 11.09.... However, when I run A.toarray() on the sparse matrix it packs all of memory and starts to use the swap at some point (it doesn't raise a MemoryError). Whats going wrong here? Shouldn't it easily fit into my memory?


Solution

  • For the csr toarray() does

    self.tocoo(copy=False).toarray(order=order, out=out)
    

    you could go on to trace coo.toarray, but I suspect it ends up using compiled code. But I suspect it ends up do the equivalent of:

    In [715]: M=sparse.random(10,10,.2,format='csr')
    In [717]: M=M.astype(np.float16)
    In [718]: A = np.zeros(M.shape, M.dtype)
    In [719]: Mo=M.tocoo()
    In [720]: A[Mo.row, Mo.col] = Mo.data
    

    Curiously though if I do

    In [728]: Mo.toarray()
         ...
        257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
    --> 258                     B.ravel('A'), fortran)
        259         return B
    ...
    ValueError: Output dtype not compatible with inputs.
    

    It's having trouble with the float16. Mo.astype(float).toarray() works fine. I get this error even if use toarray(out=out) with a float16 out, which makes me suspect coo_todense has been compiled with just a couple dtype alternatives. Maybe I'll dig into that later.

    In [741]: scipy.__version__
    Out[741]: '0.18.1'
    

    A comment in Warren's bug report

    but the xxx_todense functions are actually A += X,

    suggests that the copy from Mo.data to A[] is more complicated that what indicated. toarray sums duplicates, as it would with Mo.tocsr() or Mo.sum_duplicates().