I have a scenario where I need to multiply a small size vector a
with a huge and highly sparse matrix b
. Here's a simplified version of the code:
import numpy as np
B = 32
M = 10000000
a = np.random.rand(B)
b = np.random.rand(B, M)
b = b > 0.9
result = a @ b
In my actual use case, the b
matrix is loaded from a np.memmap
file due to its large size. Importantly, b
remains unchanged throughout the process, and will be performing the inner product with different vectors a
each time, so any pre-process on b
to leverage its sparse nature is allowed.
I'm seeking suggestions on how to optimize this matrix multiplication speed. Any insights or code examples would be greatly appreciated.
Testing with a smaller M
- I don't want to hang my machine with memory errors or long calcs:
In [340]: B = 32
...: M = 10000
...:
...: a = np.random.rand(B)
...: b = np.random.rand(B, M)
...: b = b > 0.9
...:
...:
In [341]: timeit a@b
1.23 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [342]: timeit np.einsum('i,ij',a,b)
590 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [343]: timeit np.einsum('i,ij',a,b, optimize=True)
1.57 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
I'm a little surprised what einsum
is so much faster. Often einsum
ends up using the same BLAS
functions as matmul
, especially if optimize
is turned on. The big difference between B
and M
dimensions probably has something to do with it. BLAS
type of code probably was written with 'squarish' matrices in mind.
Trying scipy.sparse:
In [344]: from scipy import sparse
In [345]: bM = sparse.csr_matrix(b)
In [346]: timeit bM = sparse.csr_matrix(b)
3.19 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [347]: bM
Out[347]:
<32x10000 sparse matrix of type '<class 'numpy.bool_'>'
with 32241 stored elements in Compressed Sparse Row format>
In [348]: bM.indptr
Out[348]:
array([ 0, 1017, 2008, 3028, 4049, 5041, 5971, 6955, 7930,
8957, 9986, 10978, 12031, 13050, 14067, 15072, 16142, 17140,
18093, 19106, 20074, 21096, 22122, 23150, 24152, 25170, 26197,
27232, 28271, 29233, 30246, 31226, 32241], dtype=int32)
Creating the sparse matrix takes time. It in effect has to do a np.nonzero(b)
on b
to find the nonzero elements.
Having made it, the multiplication proceeds as before (delegating to the bM
own version of matrix multiplication (don't use np.matmul
, np.dot
, or np.einsum
with bM
):
In [349]: res = a@bM
In [350]: type(res)
Out[350]: numpy.ndarray
In [351]: np.allclose(res, a@b)
Out[351]: True
In [352]: timeit a@bM
268 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
and the time better.
I'm not sure how the csr_matrix()
step would work with a memmap
array. But once created that matrix can probably be stored in memory, saving that disk access time.