Search code examples
pythonmatrixscipysparse-matrixlarge-data

Error creating very large sparse matrix from sub-blocks using scipy.sparse.bmat


I am trying to create a matrix using scipy.sparse.bmat from smaller csr matrices - my function call is here: sparse.bmat(HList, format='csr'). The resulting matrix will be a square matrix with ~2.6 billion columns/rows. However, I have the following error when I attempt to construct this matrix:

    Traceback (most recent call last):
      [...]/lib/python3.7/site-packages/scipy/sparse/construct.py", line 623, in bmat
        return coo_matrix((data, (row, col)), shape=shape).asformat(format)
      [...]/lib/python3.7/site-packages/scipy/sparse/coo.py", line 192, in __init__
        self._check()
      [...]/lib/python3.7/site-packages/scipy/sparse/coo.py", line 283, in _check
        raise ValueError('negative row index found')
    ValueError: negative row index found 

The problem appears to occur when the combined matrix is converted into coo format. I believe the problem has something to do with indices overflowing, as the indices of the full matrix wouldn't fit in a 32 bit format (2.6 billion > 2^31). I have tested my matrix construction script for a smaller version of this problem, and it worked correctly.

This post has a very similar problem to mine, however, the solutions listed there didn't work for my situation. Running the test described there,

>>> scipy.sparse.sputils.get_index_dtype((np.array(10**10),))
<class 'numpy.int64'>

I can confirm that numpy is using 64-bit indices.

Is there some other part of my program causing overflow? Or is the source of the problem something else entirely?

Any help is greatly appreciated!


Solution

  • import numpy as np
    from scipy.sparse import coo_matrix, csr_matrix, bmat
    
    a = coo_matrix(([1], ([int(1e9)], [int(1e9)])))
    blocks = [a.copy() for i in range(200)]
    blocks = [blocks for i in range(20)]
    
    arr = bmat(blocks, format='coo')
    

    First thing's first - this is definitely reproducible (I'm using a COO array because I don't want to allocate a 1e11 indptr array).

    ValueError: negative row index found
    

    It also doesn't help to convert the a array indices from int32 to int64. In fact, it looks like the problem is entirely internal to the bmat function

    # convert everything to COO format
    for i in range(M):
        for j in range(N):
            if blocks[i,j] is not None:
                A = coo_matrix(blocks[i,j])
    

    First, it converts all your blocks to COO matrices. If the row and column indices fit in int32s, it will use int32s (and I assume your indices do). Later on it calculates new row values by adding in an offset (based on where the blocks are). Unfortunately, this is where it overflows:

    for i, j in zip(ii, jj):
        B = blocks[i, j]
        ...
        row[idx] = B.row + row_offsets[i]
        col[idx] = B.col + col_offsets[j]
    
    >>> blocks[2, 0].row
    array([1000000000], dtype=int32)
    
    >>> blocks[2, 0].row + 2000000002
    array([-1294967294], dtype=int32)
    

    Because of that overflow (and because it's in code within bmat that you can't access externally), this is a scipy bug. That said, you can fix it very simply if you copy the scipy bmat function and retype the block index arrays as follows:

    for i, j in zip(ii, jj):
        B = blocks[i, j]
        ...
        row[idx] = B.row.astype(idx_dtype) + row_offsets[i]
        col[idx] = B.col.astype(idx_dtype) + col_offsets[j]