c++assembly optimization x86-64 integer-division

Repeated integer division by a runtime constant value

At some point in my program I compute an integer divisor d. From that point onward d is going to be constant.

Later in the code I will divide by that d several times - performing an integer division, since the value of d is not a compile-time known constant.

Given that integer division is a relatively slow process compared to other kind of integer arithmetic, I would like to optimize it. Is there some alternative format that I could store d in, so that the division process would perform faster? Maybe a reciprocal of some form?

I do not need the value of d for anything else.

The value of d is any 64-bit integer, but usually fits in 32-bit quite well.

Solution

There is a library for this—libdivide:

libdivide is an open source library for optimizing integer division

libdivide allows you to replace expensive integer divides with comparatively cheap multiplication and bitshifts. Compilers usually do this, but only when the divisor is known at compile time. libdivide allows you to take advantage of it at runtime. The result is that integer division can become faster - a lot faster. Furthermore, libdivide allows you to divide an SSE2 vector by a runtime constant, which is especially nice because SSE2 has no integer division instructions!

libdivide is free and open source with a permissive license. The name "libdivide" is a bit of a joke, as there is no library per se: the code is packaged entirely as a single header file, with both a C and a C++ API.

You can read about the algorithm behind it at this blog; for example, this entry.

Basically, the algorithm behind it is the same one that compilers use to optimize division by a constant, except that it allows these strength-reduction optimizations to be done at run-time.

Note: you can create an even faster version of libdivide. The idea is that for every divisor, you can always create a triplet (mul/add/shift), so this expression gives the result: (num*mul+add)>>shift (multiply is a wide multiply here). Interestingly, this method could beat the compiler version for constant division for several microbenchmarks!

Here's my implementation (this is not compilable out of the box, but the general algorithm can be seen):

struct Divider_u32 {
    u32 mul;
    u32 add;
    s32 shift;

    void set(u32 divider);
};

void Divider_u32::set(u32 divider) {
    s32 l = indexOfMostSignificantBit(divider);
    if (divider&(divider-1)) {
        u64 m = static_cast<u64>(1)<<(l+32);
        mul = static_cast<u32>(m/divider);

        u32 rem = static_cast<u32>(m)-mul*divider;
        u32 e = divider-rem;

        if (e<static_cast<u32>(1)<<l) {
            mul++;
            add = 0;
        } else {
            add = mul;
        }
        shift = l;
    } else {
        if (divider==1) {
            mul = 0xffffffff;
            add = 0xffffffff;
            shift = 0;
        } else {
            mul = 0x80000000;
            add = 0;
            shift = l-1;
        }
    }
}

u32 operator/(u32 v, const Divider_u32 &div) {
    u32 t = static_cast<u32>((static_cast<u64>(v)*div.mul+div.add)>>32)>>div.shift;

    return t;
}