How to parallelize a simple loop over a matrix?

I have a large matrix and I want to output all the indices where the elements in the matrix are less than 0. I have this MWE in numba:


import numba as nb
import numpy as np
A = np.random.random(size = (1000, 1000)) - 0.1
@nb.njit(cache=True)
def numba_only(arr):
    rows = np.empty(arr.shape[0]*arr.shape[1])
    cols = np.empty(arr.shape[0]*arr.shape[1])
    idx = 0
    for i in range(arr.shape[0]):
        for j in range(A.shape[1]):
            if arr[i, j] < 0:
                rows[idx] = i
                cols[idx] = j
                idx += 1
    return rows[:idx], cols[:idx]

Timing, I get:

%timeit numba_only(A)
2.29 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This is a faster than np.where(A<0) (without numba) which gives:

%timeit numpy_only(A)
3.56 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

But it is a little slower than np.where wrapped by @nb.njit:

@nb.njit(cache=True)
def numba_where(arr):
    return np.where(arr<0)

which gives:

%timeit numba_where(A)
1.76 ms ± 84.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Can the numba code be sped up by parallelization somehow? I realise it might be memory bound but I think that modern hardware should allow some level of parallel access to memory.

I am not sure how to use nb.prange to achieve this due to the index idx in the loop.

Solution

The provided algorithm is inherently sequential due to the long dependency chain of instruction which is also data dependent. However, this is possible to do an equivalent operation in parallel using a scan.

Implementing such a parallel algorithm based on a scan is much more complex (especially an efficient cache-friendly scan). A relatively simple implementation is to iterate twice on the dataset so to compute local sums, then compute cumulative sums and finally do the actual work.

@nb.njit(cache=True, parallel=True)
def numba_parallel_where(arr):
    flattenArr = arr.reshape(-1)
    arrSize = flattenArr.size
    chunkSize = 2048
    chunkCount = (arrSize + chunkSize - 1) // chunkSize
    counts = np.empty(chunkCount, dtype=np.int32)
    for chunkId in nb.prange(chunkCount):
        start = chunkId * chunkSize
        end = min(start + chunkSize, arrSize)
        count = 0
        for i in range(start, end):
            count += flattenArr[i] < 0
        counts[chunkId] = count
    offsets = np.empty(chunkCount + 1, dtype=np.int64)
    offsets[0] = 0
    for chunkId in range(chunkCount):
        offsets[chunkId + 1] = offsets[chunkId] + counts[chunkId]
    outSize = offsets[-1]
    rows = np.empty(outSize, dtype=np.int32)
    cols = np.empty(outSize, dtype=np.int32)
    n = np.int32(arr.shape[1])
    for chunkId in nb.prange(chunkCount):
        start = chunkId * chunkSize
        end = min(start + chunkSize, arrSize)
        idx = offsets[chunkId]
        for i in range(start, end):
            if flattenArr[i] < 0:
                rows[idx] = i // n
                cols[idx] = i % n
                idx += 1
    return rows, cols

Performance Results

Here are performance results on my 6-core i5-9600KF CPU with a 40 GiB/s RAM (optimal bandwidth for reads) on Windows:

Sequential:                         1.68 ms
Parallel (no pre-allocations):      0.85 ms
Parallel (with pre-allocations):    0.55 ms   <-----
Theoretical optimal:                0.20 ms

The parallel implementation is about twice faster. It succeed to reach a memory throughput of ~19 GiB/s on my machine which is sub-optimal, but not so bad. While this seems disappointing on a 6-core CPU, the parallel implementation is not very optimized and there are several points to consider:

allocating the output is really slow (is seems to take 0.2-0.3 ms)
the input array is read twice putting a lot of pressure on memory
modulo/division used to compute the indices are a bit expensive

Pre-allocating the output (filled manually with zeros to avoid possible page-faults) and passing it to the function helps to significantly reduce the execution time. This optimized version is 3 times faster than the initial (sequential) one. It reaches ~28 GiB/s which is relatively good.

Notes about Performance

Other possible optimizations include:

using clusters of chunks so to avoid iterating twice on the whole dataset (this is hard in Numba to write a fast implementation since parallelism is very limited compared to native languages);
iterate over lines of the matrix to avoid modulos/divisions assuming the number of lines is neither too big nor to small for sake of performance.

AFAIK, there is an opened bug for the slow Numba allocations since few years but it has obviously not been solved yet. On top of that, relatively large allocations can be quite expensive on Windows too. A natively-compiled code (written in C/C++/Fortran/Rust) on Linux should not exhibit such an issue.

Note that such optimizations will make the implementation even more complex than the above one.