Compiled Numba function not faster that CPython

I have a compile function with Numba that splits an array based on an index, this returns an irregular(variable length) list of numpy arrays. This then get padded to form a 2d array from the irregular list.

Problem

The compile function 'nb_array2mat' should be much faster than the pure python 'array2mat' but it is not.

Additionally, is this possible using numpy?

length of the array and index    
1456391 95007

times:
numba:  1.3438396453857422    
python:  1.1407015323638916

I think I am not using the numba compile in a proper manner. Any help would be great.

EDIT

Using dummy data as edited in the code section now I get an speed up, why does it not work with the actual data?

 length of the array and index    
 1456391 95007
    
 times:
 numba:  0.012002706527709961
 python:  0.13403034210205078

Code

idx_split: https://drive.google.com/file/d/1hSduTs1_s3seEFAiyk_n5yk36ZBl0AXW/view?usp=sharing

dist_min_orto: https://drive.google.com/file/d/1fwarVmBa0NGbWPifBEezTzjEZSrHncSN/view?usp=sharing

import time
import numba
import numpy as np
from numba.pycc import CC

cc = CC('compile_func')
cc.verbose = True  
   
@numba.njit(parallel=True, fastmath=True)
@cc.export('nb_array2mat', 'f8[:,:](f8[:], i4[:])')
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in numba.prange(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in numba.prange(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat


if __name__ == "__main__":
    cc.compile()

    
# PYTHON FUNC
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in range(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in range(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat
    
import compile_func  
#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(float)
idx = np.load('idx_split.npy').astype(int)

# DUMMY DATA
arr = np.random.randint(50, size=1456391).astype(float)
idx = np.cumsum(np.random.randint(5, size=95007).astype(int))
print(len(arr), len(idx))



#NUMBA FUNC
t0 = time.time()
print(compile_func.nb_array2mat(arr, idx))
print(time.time() - t0)

# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)

Solution

You cannot use nb.prange on the first loop since out is shared between threads and it is also read/written by them. This causes a race condition. Numba assume that there is not dependencies between iterations and this is your responsibility to guarantee this. The simplest solution is not to use a parallel loop here

Additionally, the second loop is mainly memory-bound so I do not expect a big speed up using multiple threads since the RAM is a shared resource with a limited throughput (few threads are often enough to saturate it, especially on PC where sometimes one thread is enough).

Hopefully, you do not need to create the out temporary list, just the end offsets so then to compute len_cols in the parallel loop. The maximum cols can be computed on the fly in the first loop. The first loop should be executed very quickly compared to the second loop. Filling a big matrix newly allocated is often faster in parallel on Linux since page faults can be done in parallel. AFAIK, one Windows this is less true (certainly since pages faults scale more badly). This is also better here since the range 0:len_col is variable and thus the time to fill this part of the matrix is variable causing some thread to finish after others (the slower thread bound the execution). Furthermore, this is generally much faster on NUMA machines since each NUMA node can write in its own memory.

Note that AOT compilation does not support automatic parallel execution. To quote a Numba developer:

From discussion in today's triage meeting, related to #7696: this is not likely to be supported as AOT code doesn't require Numba to be installed - this would mean a great deal of work and issues to overcome for packaging the code for the threading layers.

The same thing applies for fastmath also it is likely to be added in the next incoming release regarding the current work.

Note that JIT compilation and AOT compilation are two separate process. Thus the parameters of njit are not shared to cc.export and the signature is not shared to njit. This means that the function will be compiled during its first execution due to lazy compilation. That being said, the function is redefined, so the njit is just useless here (overwritten).

Here is the resulting code (using only the JIT implementation with an eager compilation instead of the AOT one):

import time
import numba
import numpy as np

@numba.njit('f8[:,:](f8[:], i4[:])', fastmath=True)
def nb_array2mat(arr, idx):
    # split arr by idx indexes
    s = 0
    ends = np.empty(len(idx), dtype=np.int_)
    cols = 0
    for n in range(len(idx)):
        e = idx[n]
        ends[n] = e
        len_col = e - s
        cols = max(cols, len_col)
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    rows = len(idx)
    mat = np.empty(shape=(rows, cols))
    for row in numba.prange(rows):
        s = ends[row-1] if row >= 1 else 0
        e = ends[row]
        len_col = e - s
        mat[row, 0:len_col] = arr[s:e]
        mat[row, len_col:cols] = 1000000.0
    return mat

# PYTHON FUNC
def array2mat(arr, idx):
    # split arr by idx indexes
    out = []
    s = 0
    for n in range(len(idx)):
        e = idx[n]
        out.append(arr[s:e])
        s = e
    # create a 2d array with arr values pading empty values with fill_value=1000000.0
    _len = [len(_i) for _i in out]
    cols = max(_len)
    rows = len(out)
    mat = np.full(shape=(rows, cols), fill_value=1000000.0)
    for row in range(rows):
        len_col = len(out[row])
        mat[row, :len_col] = out[row]
    return mat

#ACTUAL DATA
arr = np.load('dist_min_orto.npy').astype(np.float64)
idx = np.load('idx_split.npy').astype(np.int32)

#NUMBA FUNC
t0 = time.time()
print(nb_array2mat(arr, idx))
print(time.time() - t0)

# PYTHON FUNC
t0 = time.time()
print(array2mat(arr, idx))
print(time.time() - t0)

On my machine, the new Numba code is slightly faster: it takes 0.358 seconds for the Numba implementation and 0.418 for the Python implementation. In fact, using a sequential Numba code is even slightly faster on my machine as it takes 0.344 second.

Note that the shape of the output matrix is (95007,5469). Thus, the matrix takes 3.87 GiB in memory. You should check you have enough memory to store it. In fact the Python implementation takes about 7.5 GiB on my machine (possibly because the GC/default-allocator does not release the memory directly). If you do not have enouth memory, then the system can use the very slow swap memory (which use your storage device). Moreover, x86-64 processors use a write allocate cache policy causing written cache-lines to be actually read by default. Non temporal writes can be used to avoid this on a big matrix. Unfortunately, neither Numpy nor Numba use this on my machine. This means half the RAM throughput is wasted. Not to mention page faults are pretty expensive: in sequential, 60% of the time of the Numpy implementation is spent in page faults. The Numba code spend almost all its time writing in memory and performing page faults. Here is a related open issue.