Why LLVM IR generated by numba for vector addition is too complex

I wanted to check the LLVM IR for a vector addition from numba and noticed it generates a lot of IR just for a simple add. I was hoping a simple "add" IR but it generates 2000 lines of LLVM IR. Is there a way to get a minimal code?

from numba import jit
import numpy as np

@jit(nopython=True,nogil=True)
def mysum(a,b):
    return a+b

a, b = 1.3 * np.ones(5), 2.2 * np.ones(5)
mysum(a, b)

# Get the llvm IR
llvm_ir =list(mysum.inspect_llvm().values())[0]
print(llvm_ir)
with open("llvm_ir.ll", "w") as file:
    file.write(llvm_ir)

# Get the assembly code
asm = list(mysum.inspect_asm().values())[0]
print(asm)

with open("llvm_ir.asm", "w") as file:
    file.write(asm)

Solution

Numba generates 3 functions. The first one does the actual computation. The second one is a wrapping function meant to be called from CPython. It converts CPython dynamic objects to native types for the input values and does the opposite operation for returned values. The last function is meant to be called from other Numba functions (if any).

Converting Numpy arrays is not a trivial task (Numpy arrays are dynamic object containing a bunch of information like a memory buffer, the number of dimensions, the stride+size along each dimension, the dynamic Numpy type, etc.). This is why the code is significantly bigger with Numpy arrays than with simpler data-types like floating-point values. Indeed, the whole LLVM IR code is 20 times smaller in this case and this wrapping function is very simple.

Still, the main issue is not much the wrapping function, but the first one doing the actual computation (75% of the LLVM IR code). One reason is that a + b create a new temporary Numpy array that should be initialized and filled using an implicit loop. This implicit operation generates more code than if the code is done manually. This is certainly because Numba needs to case about many possible cases that may never happens in practice. For example, the LLVM IR of the following Numba function is twice smaller:

@jit('float64[::1](float64[::1], float64[::1])', nopython=True,nogil=True)
def mysum(a,b):
    out = np.empty(a.size, dtype=np.float64)
    for i in range(a.size):
        out[i] = a[i] + b[i]
    return out

If we remove the loop, then it is again twice smaller. This shows that the Numpy array creation/initialization take a significant fraction of the code space. The loop also takes a significant space because Numba need to support the wrap-around feature supported by Numpy arrays, and also because Numpy arrays does not have a typed data buffer. In C, arrays and pointers are much simpler and there is no wrap-around.

The generation of pretty huge IR/ASM code is pretty common in high-level languages. The code is often big due to advanced features, poor code-size optimizations. Reducing the size of the generated code is a significant work and it sometimes conflits with performance. Indeed, to get high-performance codes, compilers often need unroll loops, split the code in different variants to mitigate the cost of higher level features (eg. pointer aliasing, vectorization, removal of wrap-around) resulting in significantly bigger IR/ASM codes.