pythonparallel-processinganacondasparse-matrixnumba

How to parallelize this Python for loop when using Numba

I'm using the Anaconda distribution of Python, together with Numba, and I've written the following Python function that multiplies a sparse matrix `A` (stored in a CSR format) by a dense vector `x`:

``````@jit
def csrMult( x, Adata, Aindices, Aindptr, Ashape ):

numRowsA = Ashape[0]
Ax       = numpy.zeros( numRowsA )

for i in range( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i], Aindptr[i+1] ):

j     = Aindices[dataIdx]

Ax[i] = Ax_i

return Ax
``````

Here `A` is a large `scipy` sparse matrix,

``````>>> A.shape
( 56469, 39279 )
#                  having ~ 142,258,302 nonzero entries (so about 6.4% )
>>> type( A[0,0] )
dtype( 'float32' )
``````

and `x` is a `numpy` array. Here is a snippet of code that calls the above function:

``````x       = numpy.random.randn( A.shape[1] )
Ax      = A.dot( x )
AxCheck = csrMult( x, A.data, A.indices, A.indptr, A.shape )
``````

Notice the `@jit`-decorator that tells Numba to do a just-in-time compilation for the `csrMult()` function.

In my experiments, my function `csrMult()` is about twice as fast as the `scipy` `.dot()` method. That is a pretty impressive result for Numba.

However, MATLAB still performs this matrix-vector multiplication about 6 times faster than `csrMult()`. I believe that is because MATLAB uses multithreading when performing sparse matrix-vector multiplication.

Question:

How can I parallelize the outer `for`-loop when using Numba?

Numba used to have a `prange()` function, that made it simple to parallelize embarassingly parallel `for`-loops. Unfortunately, Numba no longer has `prange()` [actually, that is false, see the edit below]. So what is the correct way to parallelize this `for`-loop now, that Numba's `prange()` function is gone?

When `prange()` was removed from Numba, what alternative did the developers of Numba have in mind?

Edit 1:
I updated to the latest version of Numba, which is .35, and `prange()` is back! It was not included in version .33, the version I had been using.
That is good news, but unfortunately I am getting an error message when I attempt to parallelize my for loop using `prange()`. Here is a parallel for loop example from the Numba documentation (see section 1.9.2 "Explicit Parallel Loops"), and below is my new code:

``````from numba import njit, prange
@njit( parallel=True )
def csrMult_numba( x, Adata, Aindices, Aindptr, Ashape):

numRowsA = Ashape[0]
Ax       = np.zeros( numRowsA )

for i in prange( numRowsA ):
Ax_i = 0.0
for dataIdx in range( Aindptr[i],Aindptr[i+1] ):

j     = Aindices[dataIdx]

Ax[i] = Ax_i

return Ax
``````

When I call this function, using the code snippet given above, I receive the following error:

AttributeError: Failed at nopython (convert to parfors) 'SetItem' object has no attribute 'get_targets'

Giventhe above attempt to use `prange` crashes, my question stands:

What is the correct way ( using `prange` or an alternative method ) to parallelize this Python `for`-loop?

As noted below, it was trivial to parallelize a similar for loop in C++ and obtain an 8x speedup, having been run on 20-omp-threads. There must be a way to do it using Numba, since the for loop is embarrassingly parallel (and since sparse matrix-vector multiplication is a fundamental operation in scientific computing).

Edit 2:
Here is my C++ version of `csrMult()`. Parallelizing the `for()` loop in the C++ version makes the code about 8x faster in my tests. This suggests to me that a similar speedup should be possible for the Python version when using Numba.

``````void csrMult(VectorXd& Ax, VectorXd& x, vector<double>& Adata, vector<int>& Aindices, vector<int>& Aindptr)
{
// This code assumes that the size of Ax is numRowsA.
{
#pragma omp for schedule(dynamic,590)
for (int i = 0; i < Ax.size(); i++)
{
double Ax_i = 0.0;
for (int dataIdx = Aindptr[i]; dataIdx < Aindptr[i + 1]; dataIdx++)
{
}

Ax[i] = Ax_i;
}
}
}
``````

Solution

• Numba has been updated and `prange()` works now! (I'm answering my own question.)

The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:

Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called `prange()`. After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back `prange` would be fairly easy

Using Numba version 0.36.1, I can parallelize my embarrassingly parallel `for`-loop using the following simple code:

``````@numba.jit(nopython=True, parallel=True)

numRowsA = Ashape[0]
Ax = np.zeros(numRowsA)

for i in numba.prange(numRowsA):
Ax_i = 0.0
for dataIdx in range(Aindptr[i],Aindptr[i+1]):

j = Aindices[dataIdx]
In my experiments, parallelizing the `for`-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command `Ax = A.dot(x)` which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.