Numba bytecode generation for generic x64 processors? Rather than 1st run compiling a SLOW @njit(cache=True) argument

I have a pretty large project converted to Numba, and after run #1 with @nb.njit(cache=True, parallel=True, nogil=True), well, it's slow on run #1 (like 15 seconds vs. 0.2-1 seconds after compiling). I realize it's compiling byte code optimized for the specific PC I'm running it on, but since the code is distributed to a large audience, I don't want it to take forever compiling the first run after we deploy our model. What is not covered in the documentation is a "generic x64" cache=True method. I don't care if the code is a little slower on a PC that doesn't have my specific processor, I only care that the initial and subsequent runtimes are quick, and prefer that they don't differ by a huge margin if I distribute a cache file for the @njit functions at deployment.

Does anyone know if such a "generic" x64 implementation is possible using Numba? Or are we stuck with a slow run #1 and fast ones thereafter?

Please comment if you want more details; basically it's around a 50 lines of code function that gets JIT compiled via Numba and afterwards runs quite fast in parallel with no GIL. But I'm willing to give up some extreme optimization if the code can work in a generic form across multiple processors. As where I work, the PCs can vary quite a bit in terms of how advanced they are.

I looked briefly at the AOT (ahead of time) compilation of Numba functions, but these functions, in this case, have so many variables being altered I think it would take me weeks to decorate properly the functions to be compiled without a Numba dependency. I really don't have the time to do AOT, it would make more sense to just rewrite in Cython the whole algorithm, and that's more like C/C++ and more time consuming that I want to devote to this project. Unfortunately there is not (to my knowledge) a Numba -> Cython compiler project out there already. Maybe there will be in the future (which would be outstanding), but I don't know of such a project out there currently.

Solution

Unfortunately, you mainly listed all the current available options. Numba functions can be cached and the signature can be specified so to perform an eager compilation (compilation at the time of the function definition) instead of a lazy one (compilation during the first execution). Note that the cache=True flag is only meant to skip the compilation when it as already been done on the same platform before and not to share the code between multiple machine. AFAIK, the internal JIT used by Numba (LLVM-Lite) does not support that. In fact, it is exactly the purpose of the AOT compilation to do that. That being said, the AOT compilation requires the signatures to be provided (this is mandatory whatever the approach/tool used as long as the function is compiled) and it has quite strong limitations (eg. currently there is no support for parallel codes and fastmath). Keep in mind that Numba’s main use case is Just-in-Time compilation and not the ahead-of-time compilation.

Regarding your use-case, using Cython appears to make much more sense: the functions are pre-compiled once for some generic platforms and the compiled binaries can directly be provided to users without the need for recompilation on the target machine.

I don't care if the code is a little slower on a PC that doesn't have my specific processor.

Well, regarding your code, using a "generic" x86-64 code can be much slower. The main reasons lie in the use of SIMD instructions. Indeed, x86-64 processors all supports the SSE2 instruction set which provide basic 128-bit SIMD registers working on integers and floating-point numbers. Since about a decade, x86-processors supports the 256-bit AVX instruction set which significantly speed up floating-point computations. Since at least 7 years, almost all mainstream x86-64 processors supports the AVX-2 instruction set which mainly speed up integer computations (although it also improves floating-point thanks to new features). Since nearly a decade, the FMA instruction set can speed up codes using fuse-multiply adds by a factor of 2. Recent Intel processors support the 512-bit AVX-512 instruction set which not only double the number of items that can be computed per instruction but also adds many useful features. In the end, SIMD-friendly codes can be up to an order of magnitude faster with the new instruction sets compared to the obsolete "generic" SSE2 instruction set. Compilers (eg. GCC, Clang, ICC) are meant to generate a portable code by default and thus they only use SSE2 by default. Note that Numpy already uses such "new" features to speed up a lot many functions (see sorts, argmin/argmax, log/exp, etc.).