c++c performance modulo integer-division

Fastest way to get a positive modulo in C/C++

Often in my inner loops I need to index an array in a "wrap-around" way, so that (for example) if the array size is 100 and my code asks for element -2, it should be given element 98. In many high level languages such as Python, one can do this simply with my_array[index % array_size], but for some reason C's integer arithmetic (usually) rounds toward zero instead of consistently rounding down, and consequently its modulo operator returns a negative result when given a negative first argument.

Often I know that index will not be less than -array_size, and in these cases I just do my_array[(index + array_size) % array_size]. However, sometimes this can't be guaranteed, and for those cases I would like to know the fastest way to implement an always-positive modulo function. There are several "clever" ways to do it without branching, such as

inline int positive_modulo(int i, int n) {
    return (n + (i % n)) % n;
}

inline int positive_modulo(int i, int n) {
    return (i % n) + (n * (i < 0));
}

Of course I can profile these to find out which is the fastest on my system, but I can't help worrying that I might have missed a better one, or that what's fast on my machine might be slow on a different one.

So is there a standard way to do this, or some clever trick that I've missed that's likely to be the fastest possible way?

Also, I know it's probably wishful thinking, but if there's a way of doing this that can be auto-vectorised, that would be amazing.

Solution

Most of the time, compilers are very good at optimizing your code, so it is usually best to keep your code readable (for both compilers and other developers to know what you are doing).

Since your array size is always positive, I suggest you to define the quotient as unsigned. The compiler will optimize small if/else blocks into conditional instructions which have no branches:

unsigned modulo( int value, unsigned m) {
    int mod = value % (int)m;
    if (mod < 0) {
        mod += m;
    }
    return mod;
}

This creates a very small function without branches:

modulo(int, unsigned int):
        mov     eax, edi
        cdq
        idiv    esi
        add     esi, edx
        mov     eax, edx
        test    edx, edx
        cmovs   eax, esi
        ret

For example modulo(-5, 7) returns 2.

Unfortunately, since the quotient is not known they must perform an integer division, which is a bit slow compared to other integer operations. If you know the sizes of your array are power of two, I recommend keeping these function definitions in a header, so that the compiler can optimize them into a more efficient function. Here is the function unsigned modulo256(int v) { return modulo(v,256); }:

modulo256(int):                          # @modulo256(int)
        mov     edx, edi
        sar     edx, 31
        shr     edx, 24
        lea     eax, [rdi+rdx]
        movzx   eax, al
        sub     eax, edx
        lea     edx, [rax+256]
        test    eax, eax
        cmovs   eax, edx
        ret

See assembly: https://gcc.godbolt.org/z/DG7jMw

See comparison with most voted answer: http://quick-bench.com/oJbVwLr9G5HJb0oRaYpQOCec4E4

Edit: turns out Clang is able to generate a function without any conditional move instructions (which cost more than regular arithmetic operations). This difference is completely negligible in the general case due to the fact that the integral division takes around 70% of the total time.

Basically, Clang shifts value right to extend its sign bit to the whole width of m (that is 0xffffffff when negative and 0 otherwise) which is used to mask the second operand in mod + m.

unsigned modulo (int value, unsigned m) {
    int mod = value % (int)m;
    m &= mod >> std::numeric_limits<int>::digits;
    return mod + m;
}