Why is `array % 1` running 150% slower for smaller numbers than larger ones?

I would understand if numpy was smart enough to see that array % 1 gives a constant result (for that dtype), and that its runtime was independent of the input. I would also understand if runtime was longer for larger numbers, maybe because dividing a large number takes longer (compare Why is floating-point division in Python faster with smaller numbers?). But why does array % 1 run longer for smaller numbers?

from timeit import timeit

import numpy as np

a1 = np.random.randint(500,        size=20_000_000, dtype=np.int32) - 250
a2 = np.random.randint(20_000_000, size=20_000_000, dtype=np.int32) - 250

print(timeit(lambda: a1 % 1, number=5))
print(timeit(lambda: a2 % 1, number=5))
print(timeit(lambda: a1 % 1, number=5))
print(timeit(lambda: a2 % 1, number=5))

Output:

0.4755298000000039
0.19294559999980265
0.460197700000208
0.19560679999995045

Numpy information:

{'numpy_version': '2.0.0',
  'python': '3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 '
            '64 bit (AMD64)]',
  'uname': uname_result(system='Windows', node='hostname', release='11', version='10.0.22631', machine='AMD64')},
 {'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2',
                                'AVX512F',
                                'AVX512CD',
                                'AVX512_SKX',
                                'AVX512_CLX',
                                'AVX512_CNL',
                                'AVX512_ICL'],
                      'not_found': []}}]

Solution

TL;DR: the effect is caused by a combination of multiple technical details. It not only due to a branch prediction issue (as stated in the comment), but also due to the Numpy implementation which is not optimized for this use-case combined with poor compiler optimizations (especially GCC/MSVC). Compiling Numpy with Clang fixes the issue.

Under the hood

To perform this operation, Numpy calls the function generic_wrapped_legacy_loop with another function called INT_remainder passed in argument. The later computes the modulus for a single array item. It is generated from a relatively convoluted generic code (hard to locate) which can be found here. In your case, the function can be basically simplified to the following code:

static void INT_remainder(char **args, npy_intp const *dimensions, npy_intp const *steps, void *NPY_UNUSED(func)
{
    BINARY_LOOP {
        const int in1 = *(int*)ip1; // An item of the input array
        const int in2 = *(int*)ip2; // The divisor which is set to 1

        // Overflow check
        if (NPY_UNLIKELY((in2 == 0) || (in1 == NPY_MIN_INT && in2 == -1))) {
            FLAG_IF_DIVIDEBYZERO(in2);
            *(int *)op1 = 0;
        }
        else
        {
            // handle mixed case the way Python does
            const int rem = in1 % in2;
            if ((in1 > 0) == (in2 > 0) || rem == 0) {
                *(int*)op1 = rem;
            }
            else {
                *(int*)op1 = rem + in2;
            }
        }
    }
}

The first condition is an overflow check which should never happen in your case and it should be cheap since it is defined as being unlikely. In the second part, rem should be always 0 in your case since in2 is 1. Thus, the if condition is always true. However, it is executed lazily (from the left to the right) as defined in the C/C++ standard. This means in1 > 0 and in2 > 0 are evaluated first and then the equality, and finally rem == 0 only if the first is not true. The thing is in1 > 0 is sometimes true and sometimes false. The outcome of the in1 > 0 condition is dependent of the input array which is unpredictable. This is a problem for performance due to branch prediction. For more information, please read What is Branch Prediction?.

The second use-case (a2) is faster because the condition is more often true so your CPU can predict that the value will be likely be true and speculatively take the resulting associated conditional branch. Meanwhile, in the first case, the CPU cannot predict the outcome of the condition so you pay the overhead of unpredictable conditional branch (e.g. pipeline stall).

This effect can be confirmed by profiling the assembly code of Numpy on my machine.

Notes and discussion

The fact that branch prediction is an issue here is sad here, because the outcome of the condition is eventually always true! If the condition was rem == 0 || (in1 > 0) == (in2 > 0), then there would not be any performance issue in this case. However, the issue would be with positive random integers modulo 2 (which is more frequent)...

**While (in1 > 0) == (in2 > 0) is supposed to be a branch-less implementation, the target compiler choose to generate conditional branches in this case resulting in the observed effect. It would not be an issue with a branch-less assembly code. A branch-less implementation is often a bit slower than a well-predicted conditional branch and (significantly) faster than a miss-predicted conditional branch. Thus, it seems a good idea here.

Experimental solution

The target compiler is GCC (10.2) on my machine running on Linux. GCC tends to unfortunately generate such a code. pip packages (like the one of Numpy) are generally built with GCC on Linux and AFAIK MSVC on Windows (generating a code often worst than GCC -- especially here). However, one can compile Numpy with another compiler. It turns out that Clang seems to generate a better code in this case (see on GodBolt). You can now easily build Numpy 2.0 with Clang and test that using the following commands:

# Packages Debian testing: sudo apt install virtualenv libpython3.13-dev build-essential pkg-config clang
cd /tmp
virtualenv venv
./venv/bin/pip install ipython meson ninja cython spin
export PATH=$PWD/venv/bin:$PATH
git clone https://github.com/numpy/numpy.git
cd numpy
git checkout v2.0.0
git submodule update --init
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
spin ipython

Here are the results on my machine (i5-9600KF CPU):

GCC 13.2.0 (i.e. with conditional branches):
    0.7204837069984933
    0.2300826290011173
    0.7204179290001775
    0.23008121000020765

Clang 16.0.6 (i.e. without conditional branches):
    0.31062674399800017
    0.3103496809999342
    0.31044369300070684
    0.3100759939989075

As expected, Clang produces a branch-less assembly code which is faster than the version using conditional branch of GCC when they are miss-predicted but not otherwise. It does not have the observed performance issue.

Note that Clang is the default compiler on MacOS so users running on MacOS should not observe this problem.

Please note that both implementation are sub-optimal for integers since they do not benefit from SIMD instruction (though this is not easy because there is no div/idiv instructions on mainstream CPUs). Not to mention one can just set the final array to 0 in this very-specific (unlikely) use-case.