python numpy garbage-collection real-time cpython

Merits of avoiding allocations for soft realtime NumPy/CPython

I read that (soft) real-time programs often avoid heap allocations in part due to unpredictable timings, especially when stop-the-world (STW) garbage collection (GC) is used to free memory. I'm wondering if avoiding heap allocations is at all helpful for reducing lag in a main loop (say, 100 Hz) that uses NumPy and CPython. My questions:

CPython uses reference counting for the most part and a STW GC for cyclic references. Does that mean the STW part would never trigger if I don't use any objects with cyclic references? For example, scalars and NumPy arrays don't seem to have cyclic references, and most of them would not go beyond the function in which they are allocated.
Would reducing array allocations (preallocate, in-place, etc) make a significant difference?
Typical NumPy expressions allocate a temporary array for every operation; what are some good ways to get around this? Only thing that comes to mind now is very tedious Numba rewrites, and even then I'm not sure if non-ufuncs can avoid allocating a temporary array e.g. output[:] = not_a_ufunc(input)

Solution

CPython uses reference counting for the most part and a STW GC for cyclic references. Does that mean the STW part would never trigger if I don't use any objects with cyclic references? For example, scalars and NumPy arrays don't seem to have cyclic references, and most of them would not go beyond the function in which they are allocated.

As long as Numpy array contains native scalars (eg. np.float64, np.int32, etc.), it is generally fine. But if the Numpy array contains pure CPython objects, then the GC could be likely an issue (although this is rarely the case since cycles are rare and Python use a generational GC).

Actually, the GC could run a collection in both cases (especially when a new CPython objects are created/deleted, including Numpy arrays). However, the overhead of the GC is negligible with a program using natively-typed Numpy arrays since the number of reference is small (cells of the array are not visible from the garbage collector in this case as opposed to the case where the Numpy array contains pure CPython objects).

Note that reference cycles are theoretically possible with Numpy arrays containing pure CPython objects as two array can contains reference to each other:

a = np.empty(1, dtype=object)
b = np.empty(1, dtype=object)
a[0] = b
b[0] = a

Note that you can disable the GC in your targetted use-case as stated in the Python documentation. It should however not make a significant difference in most cases.

Would reducing array allocations (preallocate, in-place, etc) make a significant difference?

Definitively, yes! When you deal with many very-small Numpy arrays, creating an array is quite expensive (more than 400 ns on my machine). This post and this one are interesting examples showing the cost of allocating Numpy arrays. However, you should check this is the actual problem before applying in-place optimizations massively in a big code as it makes the code clearly harder to read and maintain (and so reducing the ability to apply further high-level optimisation later).

Typical NumPy expressions allocate a temporary array for every operation; what are some good ways to get around this?

You can use the out parameter of Numpy functions not to allocate a new array (as seen in the previous SO post link). Note that this is not always possible.

Only thing that comes to mind now is very tedious Numba rewrites, and even then I'm not sure if non-ufuncs can avoid allocating a temporary array e.g. output[:] = not_a_ufunc(input)

Using instructions like output[:] = numpy_function(...) may not help a lot as it will likely create a new temporary array and perform a copy. The copy is often expensive on big array but often cheap on small ones (due to CPU caches).

AFAIK, Numba barely optimize allocations (unless the variable is unused or this is a trivial code). However, Numba helps to avoid creating many temporary arrays. Not to mention that the creation of temporary arrays is not the only problem with small Numpy arrays : Numpy calls are quite expensive too (due to many internal checks and the C/Python context switch in the interpreter) and reducing the number of Numpy calls can quickly become tedious or tricky.