python numpy dynamic-memory-allocation numba calloc

Why is np.zeros() faster than re-initializing an existing array in Numba with Python?

Why isnumpy.zeros() faster than re-initializing an existing array?

I work with computer modeling and use numba for my work. Sometimes it is necessary to have a zeroed array to accumulate the results of some operation. In general, I suppose that zeroing an already allocated array cannot be slower than creating a new array filled with zeros, but it is not. I know about lazy selection (for example Why is the speed difference between Python's Numpy zeros and empty functions gone for larger array sizes? https://vorpus.org/blog/why-does-calloc-exist/), but it must take time to make it zeroed.

As far as I know, np.zeros use calloc and all acceleration comes from this call and should be reproducible for other languages. Any guarantees, that's always the case? It the good practices or not?

import numpy as np
import numba as nb
import benchit
nb.set_num_threads(1)

@nb.njit
def numba_operation(in_arr, out):
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            out[i,j] += in_arr[i,j] * 2 + 4
            
@nb.njit
def numba_operation_with_zeros(in_arr, out):
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            out[i,j] = 0
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            out[i,j] += in_arr[i,j] * 2 + 4

    
def every_time_generate_zeros(data):
    in_arr, out = data
    out = np.zeros(shape=(out.shape[0], out.shape[0]))
    numba_operation(in_arr, out)
    return out

def make_zeros_numba(data):
    in_arr, out = data
    numba_operation_with_zeros(in_arr, out)

def generate_arrays(n):
    in_arr = np.random.rand(2**n, 2**n)
    out = np.random.rand(2**n, 2**n)
    return in_arr, out

t = benchit.timings([every_time_generate_zeros, make_zeros_numba], 
                    {n:generate_arrays(n) for n in np.arange(9, 15, 1)}, 
                    input_name='2^n')
t.plot(modules=benchit.extract_modules_from_globals(globals()))

Results:

code output

Solution

TL;DR: the observed behaviour is due to a combination of several low-level effects related to CPU caches and virtual memory.

For large arrays, np.zeros does not actually fill anything in physical memory on mainstream platforms. In this case, the calloc system call is used internally to reserve a zeroized memory space from the Operating System (OS). This memory space is virtually allocated, not physically. Virtual memory is split in small chunks called pages. Allocated pages are only filled during first touch on mainstream OS. Note that malloc (called by np.empty) also zeroize memory for sake of security (since no information should leak from one application to another).

What this means is that np.zeros is cheap for large arrays (because of the lazy/deferred initialization) compared to manually filling arrays with zeros (much slower on mainstream OS). If you write into a newly allocated array, like in every_time_generate_zeros, pages needs to be zeroized. However, zeroing memory is performed on the fly, page per page. This is a huge difference with the make_zeros_numba implementation which first zeroize the whole array and then fill it again with non-zero values! Indeed, classical pages are typically few KiB wide (4 KiB on mainstream x86-64 platforms) so they can fit in the L1 CPU cache. When every_time_generate_zeros write a value in a virtually allocated page not yet filled with zeros, a page fault is triggered and the processor fills the whole page with zeros. The zeroized page is then in the cache so writing in it is much faster. This is why make_zeros_numba is slower in your case : the array needs to be stored to the DRAM twice because it likely do not fit in the (same) CPU cache (at least, not for n >= 2^12).

What happens under the hood is pretty complex. In fact, there are few missing details making this even more complex, but I tried to make the explantation relatively simple to be quite easy to understand so far.

If you want something fast, then you need to virtually split the array in chunks (ie. tile) filled/computed on the fly and also avoid creating temporary arrays. However, this is hard to do in non-trivial codes (in fact, not always possible). This is critical for performance because of the Memory Wall.

Additional notes and explanation

Note that page faults are also expensive. In fact they can be more expensive than reusing the same array on some system (typically servers having a big DRAM bandwidth). As a result, there are computing machine where make_zeros_numba can actually be faster! The behaviour is also dependent of the operating system and the standard C library implementation.

Using multiple threads to fill the target arrays also often impact performance of the two approach differently. Indeed, page faults can barely scale on some system (eg. Windows) while they can scale well on some others (eg. Linux). In general, DRAM writes do not scale with the number of core : only few cores are enough on most machines to saturate the DRAM bandwidth.

I deliberately not mentioned an important factor when it comes to filling memory with zeros. Modern x86-64 processors use a write-back CPU caches. This means data needs to be read from the DRAM for a cache-line to be written (possibly many times). The modified cache-lines are then written back later to the DRAM (typically during cache-misses). Reading DRAM to write zeros is inefficient (half the bandwidth is wasted). This is why modern x86-64 processors also have dedicated instruction to avoid this problem : non-temporal stores (NT-stores). The memcpy (and possibly memset) system calls generally use them when needed. NT-stores are only worth it for large arrays not fitting in RAM or ones that are never directly re-used (the later very hard to know in practice). Indeed, small arrays tends to fit in CPU caches so they do not need to be stored over and over to DRAM (much slower than CPU caches) This is why small array can behave very differently than large ones. Recent modern x86-64 processors even have special instructions to fill memory with zeros faster than usual instructions.

Note that there are also huge pages which are much bigger than classical pages (eg. 2 MiB) so to reduce the overhead of the small classical pages (especially page faults). Using them can strongly impact performance since the L1 cache is generally sufficiently large to hold 1 huge page. In fact, this is often true for the L2 (if any). The LLC cache tends to be large enough, but it is also significantly slower than the L1/L2. Besides, huge pages can be automatically used by the OS.

Finally, note that the Numba JIT can be clever enough to replace the zeroing loop with a memset which can be significantly faster because of NT-store. However, it turns out this is platform dependent so far.

Additional related posts: