NumPy on small arrays: elementary arithmetic operations performances

I am not 100% positive that this question has a solution besides "that's the overhead, live with it", but you never know.

I have a very simple set of elementary mathematical operations done on rather small 1D NumPy arrays (6 to 10 elements). The arrays' dtype is np.float32, while other inputs are standard Python floats.

The differences in timings are reproducible on all machines I have (Windows 10 64 bit, Python 3.9.10 64 bit, NumPy 1.21.5 MKL).

An example:

def NumPyFunc(array1, array2, float1, float2, float3):

    output1 = (array2 - array1) / (float2 - float1)
    output2 = array1 + output1 * (float3 - float1)

    return output1, output2

Given these inputs:

import numpy

sz = 6
array1 = 3000.0 * numpy.random.uniform(size=(sz,)).astype(numpy.float32)
array2 = 2222.0 * numpy.random.uniform(size=(sz,)).astype(numpy.float32)
float1 = float(numpy.random.uniform(100000, 1e7))
float2 = float(numpy.random.uniform(100000, 1e7))
float3 = float(numpy.random.uniform(100000, 1e7))

I get on machine 1:

%timeit NumPyFunc(array1, array2, float1, float2, float3)
3.33 µs ± 18 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And on machine 2:

%timeit NumPyFunc(array1, array2, float1, float2, float3)
1.5 µs ± 19.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

All nice and well, but I have to do these operations millions upon millions of times. One suggestion would be to use Numba LLVM JIT-compiler (which I know nothing about), but I heard it can get cumbersome to distribute an application with py2exe when Numba is involved.

So I thought I'd make a simple Fortran subroutine and wrap it with f2py, just for fun:

pure subroutine f90_small_arrays(n, array1, array2, float1, float2, float3, output1, output2)

    implicit none
    integer, intent(in) :: n
    real(4), intent(in), dimension(n) :: array1, array2
    real(4), intent(in) :: float1, float2, float3
    real(4), intent(out), dimension(n) :: output1, output2

    output1 = (array2 - array1) / (float2 - float1)
    output2 = array1 + output1 * (float3 - float1)

end subroutine f90_small_arrays

and time it in a Python function like this:

from f90_small_arrays import f90_small_arrays

def FortranFunc(array1, array2, float1, float2, float3):

    output1, output2 = f90_small_arrays(array1, array2, float1, float2, float3)
    return output1, output2

I get on machine 1:

%timeit FortranFunc(array1, array2, float1, float2, float3)
654 ns ± 0.869 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And on machine 2:

%timeit FortranFunc(array1, array2, float1, float2, float3)
286 ns ± 5.92 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Which is more than 5 times faster than NumPy, even though I am just doing basic math operations.

While I get it that array creation has its own overhead, I wasn't expecting such a big ratio between the two. I have also tried to upgrade to NumPy 1.26.3, but it is actually 15% slower than NumPy 1.21.5...

I can of course get the answer "just replace the NumPy code with the Fortran one", which will imply a loss of readability - the code doing the actual operation is in another file, a Fortran file.

It may also be that there is nothing that can be done to narrow the gap between NumPy and Fortran, and the overhead of operations in NumPy arrays is what it is.

But of course any ideas/suggestions are more than welcome :-) .

Solution

I am not 100% positive that this question has a solution besides "that's the overhead, live with it", but you never know.

Generally, Numpy is not optimized for computing small arrays (also for computing arrays where the last target axis is small too). For example, creating arrays, collecting them, analysing types, etc. is pretty expensive.

All nice and well, but I have to do these operations millions upon millions of times. One suggestion would be to use Numba (which I know nothing about), but I heard it can get cumbersome to distribute an application with py2exe when Numba is involved.

This is indeed, a usual way to fix that. Note that Cython can be better for the packaging. However, Numba can easily speed up the creation of Numpy arrays (without fixing it completely) while this is more difficult in Cython. THe main reason is that Numba use its own implementation of Numpy and it can optimize its code thanks to the JIT compiler while Cpython and Cython do not. The way to speed up Numpy operations in Cython is typically to create views on it. Creating new small arrays and Numpy functions is still expensive.

Alternative include calling directly native functions (C, C++, Fortran, Rust, etc.) possibly with the help of some binding library/tool. Such native function can be called from Cython, Ctypes, CFFI, etc.

So I thought I'd make a simple Fortran subroutine

This is a pretty good solution regarding the target granularity (though calling it in a portable way can be sometimes a bit tedious). Note that the function do not create new array and switch from CPython to native code only once (not to mention the same thing applies for type checking). This matters regarding overheads.

While I get it that array creation has its own overhead, I wasn't expecting such a big ratio between the two.

Numpy needs to perform significantly more operations than your Fortran function. Numpy is designed to be generic while your Fortran function address specifically one problem. NumPyFunc performs at least 6 function calls to Numpy and Numpy cannot optimize that because of the CPython interpreter. Interpreters are painfully slow (especially CPython due to the very-dynamic properties of the language and its numerous high-level feature). It also creates 6 temporary arrays (some might be recycled now so to speed up operations on large arrays, but this optimization introduce even more overheads). Each array (which is actually a view on a raw memory buffer dynamically reference counted) is a CPython object which needs to be allocated, reference counted and freed. During each operation, Numpy needs to dynamically check the type, shape, dimension, stride, target axis of the input array objects. It also needs to check/perform some high-level features like wrap-around, broadcasting (for each dimension). Because of the numerous possible input parameters, Numpy performs this by creating generic internal multi-dimensional generators (1 per input array). This part of the Numpy code is sub-optimal (so it can be optimized though it is pretty complicated IMHO so it has not been done much yet). For example, each of them is AFAIK dynamically allocated/freed. On top of that, Numpy is designed to be rather user-friendly so it checks/reports issues like division by zeros, support special values like NaN, Inf, etc. (while trying to mitigate the associated overhead as much as possible). Last but not least, the CPython GIL (global interpreter lock) needs to be released and acquired again. All of this is not an exhaustive list: there are more (technical) operations to do I did not mentioned here (eg. pickling the best function to run so for the operation to to be done in a SIMD-friendly way, the CPython fetching of Numpy module variables, etc.).

Almost none of that is done by your Fortran code at runtime, and when it is, it is only done once instead of 6 times. Most of the operations are either not done or simply done at compile-time. This is mainly possible because the code is specialized. This reasonably justify the "5 times faster" code at this granularity (especially since a single cache miss requiring a fetch from the DRAM typically takes 50-100 ns).

It may also be that there is nothing that can be done to narrow the gap between NumPy and Fortran, and the overhead of operations in NumPy arrays is what it is.

While there are certainly ways to do micro-optimizations of the Numpy code, this will not results in a huge performance improvement. The key is simply not to use Numpy for arrays with only 6-10 items. In fact, calling a native function is already very expensive for that. Indeed, modern x86-64 CPUs can add 6-10 items between two arrays in **only few CPU cycles thanks to SIMD instructions! The slowest purely-computing part is certainly to check the size of the array and support non-power-of-two sizes without doing any out-of-bound accesses. The computing part of your Fortran function should take a very tiny fraction of the reported time! Without considering possibly cache misses or DRAM accesses, it should certainly take no more than 5-30 ns! All the rest (90-98%) is actually pure overheads! At this scale, even a (non-inlined) native function call can be expensive so there is not even a chance for Numpy/CPython to do that so fast.

To emphasise how fast your Fortran code can be, here is its assembly code (compiled with gfortran and -O2 optimizations):

f90_small_arrays_:
        mov     r11, rdi
        mov     rax, rcx
        movss   xmm1, DWORD PTR [r8]
        mov     rdi, rdx
        movss   xmm2, DWORD PTR [rax]
        movsx   rdx, DWORD PTR [r11]
        mov     rcx, QWORD PTR [rsp+8]
        mov     r10, QWORD PTR [rsp+16]
        subss   xmm1, xmm2
        test    edx, edx
        jle     .L1
        sal     rdx, 2
        xor     eax, eax
.L3:
        movss   xmm0, DWORD PTR [rdi+rax]
        subss   xmm0, DWORD PTR [rsi+rax]
        divss   xmm0, xmm1
        movss   DWORD PTR [rcx+rax], xmm0
        add     rax, 4
        cmp     rdx, rax
        jne     .L3
        movss   xmm1, DWORD PTR [r9]
        xor     eax, eax
        subss   xmm1, xmm2
.L4:
        movss   xmm0, DWORD PTR [rcx+rax]
        mulss   xmm0, xmm1
        addss   xmm0, DWORD PTR [rsi+rax]
        movss   DWORD PTR [r10+rax], xmm0
        add     rax, 4
        cmp     rdx, rax
        jne     .L4
.L1:
        ret

On my i5-9600KF CPU, the header footer takes 5-20 cycle. The first loop takes 3 cycle/item and the second takes 1.5 cycle/items. Note the number of cycle is so small because instructions are pipelined and executed in parallel. In the end, the whole function should take 50-100 cycles on my machine for 10 items, that is, less than 10-20 ns! In practice, this code does not benefit from SIMD instructions. It can be even faster with -O3 (producing a 128-bit SIMD code), since 4 items can be computed at once for nearly the same cost than 1. In practice, the arrays are so small that is should not results in more than twice faster (~10 ns) execution and most of the time should not even be really spent in the main SIMD loop (<5 ns on my CPU). Moreover, the two loops can be merged. In this case, I expect the main computing overhead to actually come from miss-prediction of loop iterations (because the CPU does not know the size of the array unless this code is run very frequently with the same array size), not to mention calling this function CPython which takes hundreds of nanoseconds.

Thus, if you really care about performance, you should avoid calling the Fortran many times and put as much work as possible in your native function(s) (currently done in CPython). Moreover, providing the size of the arrays at compile-time can also strongly improve performance in this case. Using -ffast-math can improve that even further (by using a fast-reciprocal instruction instead of a slow division) though you should be careful about its implications. The later results in a code generating 36 instructions taking only 5-10 ns on my CPU (compared to ~300 ns for calling the Fortran function from CPython and ~3000 ns for the Numpy code). It is so fast that the performance of the native code is bound by the latency of the CPU instructions. Thus, my CPU can pipeline the computation of many arrays (of 10 items) simultaneously while taking only 2 ns per array! Newer CPUs can even reach 1 ns/array! This is much faster than executing any CPython statement (including a basic variable assignment) or also much faster then even calling a native function call. Not to mention reference counting on a single CPython object/Numpy-array or GIL operations are far slower than this too (the atomic operations performed usually take dozens of ns). This is why nearly-all CPython module cannot really help you (compiler-based modules are the best solution if you want to keep your main code in CPython despite the huge overhead required to convert CPython types to native ones and call native functions from the interpreter).