numpy performance parallel-processing multiprocessing numba

Faster/parallelized way to merge multiple Numpy 3d arrays into one existing 3d array

I have a list containing 100 Numpy arrays/images with the shape (1024, 1024, 4). I have a second already existing Numpy array with the shape (1024, 102400, 4).

I try to merge them into an already existing array like so:

import numpy as np

imgs = [np.random.randint(0, 255, (1024, 1024, 4), np.uint8) for _ in range(100)]
imgs_arr = np.empty((1024, 1024 * 100, 4), np.uint8)
for i in range(100):
    imgs_arr[:, i * 1024:(i + 1) * 1024] = imgs[i]

Takes 0.1025 seconds.

The imgs_arr will only be created once at start up and then reused over and over again.

Is there any way to increase the performance of the loop or parallelize the loop to fill the array with the list of images using either Numba, multiprocessing, threading or just Numpy itself?

A potential setup process once at startup would also be fine, e.g. setting up workers for multiprocessing, but I'm looking for better performance once the setup is done. However, I'm not sure how to approach this problem and would be thankful for any help.

I tried the following with Numba, but without success. It also comes with other problems as described below.

import numpy as np

from numba import carray, njit, prange
from numba.extending import intrinsic
from numba.typed import List

@intrinsic
def address_to_void_pointer(typingctx, src):
    from numba.core import types, cgutils
    sig = types.voidptr(src)

    def codegen(cgctx, builder, sig, args):
        return builder.inttoptr(args[0], cgutils.voidptr_t)

    return sig, codegen


@njit(cache=True, parallel=True)
def numba_parallelization_func(imgs_arr_addr, imgs, shape,
                               dtype, img_count):
    imgs_arr = carray(address_to_void_pointer(imgs_arr_addr), shape, dtype)
    for i in prange(img_count):
        imgs_arr[:, i * 1024:(i + 1) * 1024] = imgs.getitem_unchecked(i)

    return None


def numba_parallelization() -> None:
    imgs = List([np.random.randint(0, 255, (1024, 1024, 4), np.uint8) for _ in range(100)])
    imgs_arr = np.empty((1024, 1024 * 100, 4), np.uint8)
    imgs_arr_addr = imgs_arr.ctypes.data
    numba_parallelization_func(imgs_arr_addr, imgs, imgs_arr.shape,
                               imgs_arr.dtype, len(imgs))

    return None

numba_parallelization_func takes 0.0975 after compilation/cache lookup.

Not only is this attempt barley faster, but it also needs a typed list. But the list of arrays is filled from a queue using append. And while the following is fast...

import numpy as np

def lst_append():
    arrs = [np.random.randint(0, 255, (1024, 1024, 4), np.uint8) for _ in range(100)]
    lst = []
    for arr in arrs:
        lst.append(arr)

    return None

The loop takes 5.4999e-06 seconds.

...the typed list even if pre-allocated is so slow, that it renders any perfomance gains of Numba useless:

import numpy as np

from numba.typed import List

def typed_lst_append():
    arrs = [np.random.randint(0, 255, (1024, 1024, 4), np.uint8) for _ in range(100)]
    typed_lst = List([np.random.randint(0, 255, (1024, 1024, 4), np.uint8) for _ in range(100)])
    for i in range(100):
        typed_lst[i] = arrs[i]

    return None

The loop takes 0.2541 seconds.

Edit: It is important to note that this problem isn't limited to 100 arrays or a shape of (1024, 1024, 4). This is meant as an example. The imgs list is compiled from tens of thousands of images. This is done with an image cache. And while in many requests to the cache most images are available, there are usually at least some images that need to be loaded. I can't hold all images in memory.

Solution

TL;DR: memory accesses are sub-optimal and you can use streaming stores to make this faster (1.5x). Alternatively, you can try to make the overall code more cache-friendly assuming this is actually possible in your specific case (probably not based on comments).

Takes 0.1025 seconds.

On my system it takes 42.8 ms.

The operation is memory bound. Indeed, the reading throughput of my RAM is 19.2 GiB/s while the writing throughput is 9.3 GiB/s. This means a combined global throughput of 28.5 GiB/s. To understand how good this is, a raw memory copy in Numpy takes 17.6 + 17.6 = 35.2 GiB/s. My RAM bandwidth is theoretically about 42 GiB/s. This means the code provided in the question certainly saturate ~80% of my RAM. Compute-related optimization are useless in this case.

numba_parallelization_func takes 0.0975 after compilation/cache lookup.

On my system it takes 36.9 ms.

Using more cores can sometime help to mitigate latency issues on one core. Moreover, on servers, it is not rare for one core not to be able to saturate the memory (by design). This is certainly why you get a small speed up in Numba using more cores but it is not possible to get more than a 25% on my machine due to the RAM saturation as long as the same approach is used. In practice the speed up is 16% here. The RAM throughput is 22.4 + 10.7 = 33.1 GiB/s. This is 94% of the (nearly-optimal) copy throughput and so pretty good. That being said, I do not recommend using the Numba code since it is unsafe and use a lot of cores (6 on my CPU) for a very little speed improvement.

the typed list even if pre-allocated is so slow, that it renders any perfomance gains of Numba useless

Yes. This is AFAIK a known problem of Numba typed lists.

Is there any way to increase the performance of the loop or parallelize the loop to fill the array with the list of images using either Numba, multiprocessing, threading or just Numpy itself?

There is one. Indeed, based on the size of imgs_arr, we should expect a reading throughput of imgs_arr.size/42.8e-3/1024**3 = 9.1 GiB For the Numpy code and not 19.2 GiB/s. The same thing should happen for the writing throughput which is the case (i.e. 9.3 vs 9.1 GiB/s). The reading throughput is significantly higher than it should!

This is because of the cache write allocate policy of mainstream CPUs : when data is written in the caches, the target cache-line is first fetched from the DRAM before being written back when the cache line is filled. This is because the CPU has no idea whether the cache will be completely filled or not and data may not be aligned too (so it is not always the case in practice).

The only way I am aware to solve this problem is to use streaming stores (a.k.a. non temporal stores of SIMD units) on x86-64 CPUs. It should be used by optimized functions like memcpy in C/C++ as long as the target block of memory is sufficiently large. Indeed, streaming stores tend to bypass caches and write directly in RAM (this is a more complex in practice but lets keep this simple for sake of clarity) which means it is not a good idea to do that if the block of memory fit in cache and is reused later. The bad news is that only the developer can know that (and sometimes it is even not easy to know it) not to mention this assume the developer target one specific cache size which is rare nowadays. Thus, using streaming stores can be a good idea on some CPU with a limited cache size and be worse one CPU with a very large cache (assuming data are reused later which is expected). That being said imgs_arr takes 400 MiB here and there is currently no mainstream CPUs with so large caches. This means streaming store should help. With them the code should be up to 50% faster (optimal for the target operation).

In your case, both Numpy and Numba should not use streaming stores because copied chunks are too small and they do not know whether data is reused or not. Besides, Numba seems not to use them even when it should on some machines. Because of that, I do not think this is possible to do that in Numpy. I also do not think there is any simple way to do that in Numba (one need to use SIMD streaming store instruction directly which is neither simple nor portable, not to mention it only works on aligned data). This is why I requested adding this feature in Numba.

AFAIK, the best solution is to write a native low-level code (eg. in C/C++) for that. One way is to manually use the following x86-64-specific intrinsics:

void _mm_stream_si128(void* mem_addr, __m128i a);    // movntdq
void _mm256_stream_si256(void* mem_addr, __m256i a); // vmovntdq
void _mm512_stream_si512(void* mem_addr, __m512i a); // vmovntdq

Note mem_addr must be aligned (on a 16-byte boundary for the 128-bit one, 32-byte for the 256-bit one and 64-byte for the 512-bit one). Also note that only few CPU support the AVX-512 SIMD instruction set (even some recent Intel CPUs) while AVX is supported by more than ~95% of x86-64 CPUs nowadays. Thus, I advised you to use the 256-bit one. If you care about compatibility, then you should use the 128-bit one (it runs on all x86-64 CPUs). Note that this solution require to write a code specific to each CPU ISA (eg. x86/x86-64, ARM, POWER, etc.) though not supporting other platform might not be a problem if you do not plan to run your code on ARM CPUs or others.

An alternative solution is to use the nontemporal clause of OpenMP SIMD directives though it might not be supported by the target compiler. You should check the generated assembly code to be sure. Unfortunately, It seems the 3 maintreams compilers GCC, Clang and MSVC generate an inefficient code in this case. ICC (old Intel compiler) seems to generate a good code and it is not clear for ICX (new Intel compiler).

Another similar solution is to use a compiler-specific preprocessor directive when it is actually available although it is not portable from one compiler to another. AFAIK, Intel compilers have such a directive, but the OpenMP solution which is meant to be portable seems to already be good when compiled with the Intel compilers anyway.

Thus, as of now, I recommend you to write a C/C++ code doing this manually if you really care a lot about performance. An alternative solution is to rewrite your code so to make it more cache-friendly (i.e. avoid creating large arrays like this). This is significantly better than using streaming stores but not always possible.