scipy super sparse matrix multiplication is super slow

There are some posts on SO discussing sparse matrix multiplication performance, but they don't seem to answer my question here.

Here is the benchmark code,

# First, construct a space matrix
In [1]: from scipy.sparse import dok_matrix
In [2]: M = dok_matrix((100, (1<<32)-1), dtype=np.float32)
In [3]: rows, cols = M.shape
In [5]: js = np.random.randint(0, (1<<32)-1, size=100)
In [6]: for i in range(rows):
   ...:     for j in js:
   ...:         M[i,j] = 1.0
   ...:

# Check out sparsity
In [7]: M.shape
Out[7]: (100, 4294967295)
In [8]: M.count_nonzero()
Out[8]: 10000

# Test csr dot performance, 36.3 seconds
In [9]: csr = M.tocsr()
In [10]: %time csr.dot(csr.T)
CPU times: user 36.3 s, sys: 1min 1s, total: 1min 37s
Wall time: 1min 46s
Out[10]:
<100x100 sparse matrix of type '<class 'numpy.float32'>'
    with 10000 stored elements in Compressed Sparse Row format>

The above csr.dot costs 36.3s, which is quite long IMHO. In order to speed up, I coded up a naive for-loop dot function as follows,

def lil_matmul_transposeB(A, B):
    rows_a, cols_a = A.shape
    rows_b, cols_b = B.shape
    assert cols_a == cols_b
    C = np.zeros((rows_a, rows_b))

    for ra in range(rows_a):
        cols_a = A.rows[ra]
        data_a = A.data[ra]
        for i, ca in enumerate(cols_a):
            xa = data_a[i]
            for rb in range(rows_b):
                cols_b = B.rows[rb]
                data_b = B.data[rb]
                pos = bs(cols_b, ca)
                if pos!=-1:
                    C[ra,rb] += data_b[pos] * xa
    return C

# Test dot performance in LiL format,
In [25]: lil = M.tolil()
In [26]: %time A = F.lil_matmul_transposeB(lil, lil)
CPU times: user 1.26 s, sys: 2.07 ms, total: 1.26 s
Wall time: 1.26 s

The above function only costs 1.26s, much faster than the built-in csr.dot.

So I wonder if I made some mistakes here to do the sparse matrix multiplication?

Solution

That very large 2nd dimension is giving problems, even though the sparsity is quite small.

In [12]: Mr = M.tocsr()
In [20]: Mr
Out[20]: 
<100x4294967295 sparse matrix of type '<class 'numpy.float32'>'
    with 10000 stored elements in Compressed Sparse Row format>

Transpose just turns the csr into csc, without changing the arrays. That indptr for both is just (101,).

In [21]: Mr.T
Out[21]: 
<4294967295x100 sparse matrix of type '<class 'numpy.float32'>'
    with 10000 stored elements in Compressed Sparse Column format>

But when I do [email protected], I get an error when it tries to convert that Mr.T to `csr. That is, the multiplication requires the same format:

In [22]: Mr.T.tocsr()
Traceback (most recent call last):
  File "<ipython-input-22-a376906f557e>", line 1, in <module>
    Mr.T.tocsr()
  File "/usr/local/lib/python3.8/dist-packages/scipy/sparse/csc.py", line 138, in tocsr
    indptr = np.empty(M + 1, dtype=idx_dtype)
MemoryError: Unable to allocate 32.0 GiB for an array with shape (4294967296,) and data type int64

It's trying to make a matrix with a indptr that's (4294967296,) long. On my limited RAM machine that produces an error. On your's it must be hitting some sort of memory management/swap task that slowing it way down.

So it's the extreme dimension that's making this slow even though the nnz is small.