python matrix scipy sparse-matrix large-data

Error creating very large sparse matrix from sub-blocks using scipy.sparse.bmat

I am trying to create a matrix using scipy.sparse.bmat from smaller csr matrices - my function call is here: sparse.bmat(HList, format='csr'). The resulting matrix will be a square matrix with ~2.6 billion columns/rows. However, I have the following error when I attempt to construct this matrix:

    Traceback (most recent call last):
      [...]/lib/python3.7/site-packages/scipy/sparse/construct.py", line 623, in bmat
        return coo_matrix((data, (row, col)), shape=shape).asformat(format)
      [...]/lib/python3.7/site-packages/scipy/sparse/coo.py", line 192, in __init__
        self._check()
      [...]/lib/python3.7/site-packages/scipy/sparse/coo.py", line 283, in _check
        raise ValueError('negative row index found')
    ValueError: negative row index found

The problem appears to occur when the combined matrix is converted into coo format. I believe the problem has something to do with indices overflowing, as the indices of the full matrix wouldn't fit in a 32 bit format (2.6 billion > 2^31). I have tested my matrix construction script for a smaller version of this problem, and it worked correctly.

This post has a very similar problem to mine, however, the solutions listed there didn't work for my situation. Running the test described there,

>>> scipy.sparse.sputils.get_index_dtype((np.array(10**10),))
<class 'numpy.int64'>

I can confirm that numpy is using 64-bit indices.

Is there some other part of my program causing overflow? Or is the source of the problem something else entirely?

Any help is greatly appreciated!

Solution

import numpy as np
from scipy.sparse import coo_matrix, csr_matrix, bmat

a = coo_matrix(([1], ([int(1e9)], [int(1e9)])))
blocks = [a.copy() for i in range(200)]
blocks = [blocks for i in range(20)]

arr = bmat(blocks, format='coo')

First thing's first - this is definitely reproducible (I'm using a COO array because I don't want to allocate a 1e11 indptr array).

ValueError: negative row index found

It also doesn't help to convert the a array indices from int32 to int64. In fact, it looks like the problem is entirely internal to the bmat function

# convert everything to COO format
for i in range(M):
    for j in range(N):
        if blocks[i,j] is not None:
            A = coo_matrix(blocks[i,j])

First, it converts all your blocks to COO matrices. If the row and column indices fit in int32s, it will use int32s (and I assume your indices do). Later on it calculates new row values by adding in an offset (based on where the blocks are). Unfortunately, this is where it overflows:

for i, j in zip(ii, jj):
    B = blocks[i, j]
    ...
    row[idx] = B.row + row_offsets[i]
    col[idx] = B.col + col_offsets[j]

>>> blocks[2, 0].row
array([1000000000], dtype=int32)

>>> blocks[2, 0].row + 2000000002
array([-1294967294], dtype=int32)

Because of that overflow (and because it's in code within bmat that you can't access externally), this is a scipy bug. That said, you can fix it very simply if you copy the scipy bmat function and retype the block index arrays as follows:

for i, j in zip(ii, jj):
    B = blocks[i, j]
    ...
    row[idx] = B.row.astype(idx_dtype) + row_offsets[i]
    col[idx] = B.col.astype(idx_dtype) + col_offsets[j]