Are Python C extensions faster than Numba JIT?

I am testing the performance of the Numba JIT vs Python C extensions. It seems the C extension is about 3-4 times faster than the Numba equivalent for a for-loop-based function to calculate the sum of all the elements in a 2d array.

Update:

Based on valuable comments, I realized a mistake that I should have compiled (called) the Numba JIT once. I provide the results of the tests after the fix along with extra cases. But the question remains on when and how which method should be considered.

Here's the result (time_s, value):

# 200 tests mean (including JIT compile inside the loop)
Pure Python: (0.09232537984848023, 29693825)
Numba: (0.003188209533691406, 29693825)
C Extension: (0.000905141830444336, 29693825.0)

# JIT once called before the test loop (to avoid compile time)
Normal: (0.0948486328125, 29685065)
Numba: (0.00031280517578125, 29685065)
C Extension: (0.0025129318237304688, 29685065.0)

# JIT no warm-up also no test loop (only calling once)
Normal: (0.10458517074584961, 29715115)
Numba: (0.314251184463501, 29715115)
C Extension: (0.0025091171264648438, 29715115.0)

Is my implementation correct?
Is there a reason for why C extensions are faster?
Should I probably always use C extensions if I want the best performance? (non-vectorized functions)

main.py

import numpy as np
import pandas as pd
import numba
import time
import loop_test # ext


def test(fn, *args):
    res = []
    val = None
    for _ in range(100):
        start = time.time()
        val = fn(*args)
        res.append(time.time() - start)
    return np.mean(res), val


sh = (30_000, 20)
col_names = [f"col_{i}" for i in range(sh[1])]
df = pd.DataFrame(np.random.randint(0, 100, size=sh), columns=col_names)
arr = df.to_numpy()


def sum_columns(arr):
    _sum = 0
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            _sum += arr[i, j]
    return _sum


@numba.njit
def sum_columns_numba(arr):
    _sum = 0
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            _sum += arr[i, j]
    return _sum


print("Pure Python:", test(sum_columns, arr))
print("Numba:", test(sum_columns_numba, arr))
print("C Extension:", test(loop_test.loop_fn, arr))

ext.c

#define PY_SSIZE_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>

static PyObject *loop_fn(PyObject *module, PyObject *args)
{
    PyObject *arr;
    if (!PyArg_ParseTuple(args, "O!", &PyArray_Type, &arr))
        return NULL;

    npy_intp *dims = PyArray_DIMS(arr);
    npy_intp rows = dims[0];
    npy_intp cols = dims[1];
    double sum = 0;
    PyArrayObject *arr_new = (PyArrayObject *)PyArray_FROM_OTF(arr, NPY_DOUBLE, NPY_ARRAY_IN_ARRAY);
    double *data = (double *)PyArray_DATA(arr_new);
    npy_intp i, j;
    for (i = 0; i < rows; i++)
        for (j = 0; j < cols; j++)
            sum += data[i * cols + j];
    Py_DECREF(arr_new);
    return Py_BuildValue("d", sum);
};

static PyMethodDef Methods[] = {
    {
        .ml_name = "loop_fn",
        .ml_meth = loop_fn,
        .ml_flags = METH_VARARGS,
        .ml_doc = "Returns the sum using for loop, but in C.",
    },
    {NULL, NULL, 0, NULL},
};

static struct PyModuleDef Module = {
    PyModuleDef_HEAD_INIT,
    "loop_test",
    "A benchmark module test",
    -1,
    Methods};

PyMODINIT_FUNC PyInit_loop_test(void)
{
    import_array();
    return PyModule_Create(&Module);
}

setup.py

from distutils.core import setup, Extension
import numpy as np

module = Extension(
    "loop_test",
    sources=["ext.c"],
    include_dirs=[
        np.get_include(),
    ],
)

setup(
    name="loop_test",
    version="1.0",
    description="This is a test package",
    ext_modules=[module],
)

python3 setup.py install

Solution

I would like to complete the good answer of John Bollinger:

First of all, C extensions tends to be compiled with GCC on Linux (possibly MSVC on Windows and Clang on MacOS AFAIK), while Numba uses the LLVM compilation toolchain internally. If you want to compare both, then you should use Clang which is based on the LLVM toolchain. In fact, you should also use the same version of LLVM than Numba for the comparison to be fair. Clang, GCC and MSVC are not optimizing codes the same way so the resulting program can have pretty different performances.

Moreover, Numba is a JIT so it does not care about the compatibility (of instruction set extensions) between different platforms. This means it can use the AVX-2 SIMD instruction set if available on your machine while mainstream compilers will not do that by default for sake of compatibility. In fact, Numba actually does that. You can specify Clang and GCC to optimize the code for the target machine and not to care about compatibility between machines with the compilation flag -march=native. As a result, the resulting package will certainly be faster but can also crash on old machines (or be possibly significantly slower). You can also enable some specific instruction set (with flags like -mavx2).

Additionally, Numba uses an aggressive optimization level by default while AFAIK C extension use the -O2 flags which does *not auto-vectorize the code by default on both GCC and Clang (i.e. no use of packed SIMD instructions). You should certainly specify manually to use the -O3 flag if this is not already done. On MSVC, the equivalent flag is /O2 (AFAIK there is no /O3 yet).

Please note that Numba functions can be compiled eagerly (as opposed to lazily by default) by providing a specific signature (possibly multiple one). This means you should know the type of the input parameters and the start-up time of your application can significantly increase. Numba functions can also be cached so not to recompile the funciton over and over on the same platform. This can be done with the flag cache=True. It may not always work regarding your specific use-case though.

Last but not least, the two codes are not equivalent. This is certainly the most important point. The Numba code deal with an int32-typed arr and accumulate the value in a 64-bit integer _sum, while the C extension accumulate the value in a double-precision floating-point type. Floating-point types are not associative (unless you tell the compiler to assume they are, with the flag -ffast-math, which is not enabled by default since it is unsafe) so accumulating floating-point numbers is far more expensive than integers due to the high latency of the FMA unit on most platform. Besides, I actually wonder if PyArray_FROM_OTF performs the correct conversion, but if it does, then I expect the conversion to be pretty expensive. You should use the same types in the two code for the comparison to be fair (possibly 64-bit integers in the two).

For more information, please read the related posts: