c++math x86 micro-optimization integer-division

Most efficient way to find the greatest number not greater than A, which is divisible by B

I have 2 numbers A and B. I want to find C = A - (A % B), but there are some problems. First of all, if C and D = A / B should have the same parity ((even and even) or (odd and odd)), otherwise C should be incremented (++C). The second problem is that I constantly do this calculation, so I want the cost of it to be as small as possible. Right now my solution looks like this:

uint32_t D = A / B;
C = D * B;
if ((C ^ D) & 0x1) ++C;

Is there a better way to do this? Maybe (C % 2) != (D % 2) is faster because of compiler optimizations, but I can't proof it. I would also like to know if it can be done with some specific intel functions (registers).

Solution

I assume the inputs A and B are also uint32_t?

The cost of the division dwarfs everything else, unless B is known at compile time after inlining. (Even if it's not a power of 2). The actual div instruction is very expensive compared to anything else, and can't vectorize with SIMD. (The only SIMD division available on x86 is FP, or of course integer shifts for division by 2).

By far the most useful thing you could do is arrange for B's value to be visible to the compiler at compile time, or at least with link-time optimization for cross-file inlining. (Why does GCC use multiplication by a strange number in implementing integer division?)

If B isn't a compile-time constant, x86 division will produce the remainder for free, along with the quotient. sub is cheaper than imul, so use and let the compiler optimize:

uint32_t D = A / B;
uint32_t C = A - A % B;

And if B is a compile-time constant, the compiler will optimize it to a divide then multiply anyway and (hopefully) optimize this down to as good as you'd get with your original.

And no, (C^D) ^ 1 should be a more efficient way to check that the low bits differ than (C % 2) != (D % 2). Doing something separate to each input before combining would cost more instructions, so it's better to lead the compiler in the direction of the more efficient asm implementation. (Obviously it's a good idea to have a look at the asm output for both cases).

Possibly useful would be to use + instead of ^. XOR = Addition without carry, but you only care about the low bit. The low bit of ^ and + is always the same. This gives the compiler the option of using an lea instruction to copy-and-add. (Probably not helpful in this case; it's ok if the compiler destroys the value in the register holding D, assuming it's dead after this. But if you also use D directly)

Of course, you don't actually want to branch with if(...) so you should write it as:

C += (C+D) & 1;       // +1 if low bits differ