c assembly optimization integer-division

Computing 2¹²⁸ % uint64_t

How to efficiently compute 2¹²⁸ % n, where n is uint64_t (and non-zero)?

If I had access to a narrowing division like _udiv128 intrinsic of MSVC I could do:

uint64_t remainder;
_udiv128(1, 0, n, &remainder);
_udiv128(remainder, 0, n, &remainder);

But that's two slow instructions, not to mention there are CPUs like ARM which do not have the narrowing division. Is there a better way?

Related: How to compute 2⁶⁴/n in C?

Solution

The number of narrowing divisions can be reduced to 1 with a simple trick:

uint64_t remainder;
_udiv128(-n % n, 0, n, &remainder);

-n % n still costs a 64-bit division, but a division where the high half of the dividend is zero tends to be more efficient (certainly more efficient than the "full" libdivide_128_div_64_to_64, using its slow path, which costs two 64-bit divisions plus a bunch of additional stuff). Some processors don't care as much about a non-zero upper half of the dividend though.

This trick works because 2⁶⁴ is definitely greater than n, so we can subtract n from it once, then we get 2⁶⁴-n which is equal to -n computed modulo 2⁶⁴ (arithmetic on uint64_t is done modulo 2⁶⁴). We can't just put -n into the high part of the dividend because then the quotient becomes too large and the division faults, but a dedicated 128-bit by 64-bit remainder operation (as opposed to a division that also produces a remainder as a by-product) could support that. So -n % n it is.

The _udiv128 intrinsic could be replaced with a call to libdivide_128_div_64_to_64 to support other platforms.

It seems plausible to me (given the special nature of this case) that there are more tricks, but I don't know them and couldn't find them. A common recommendation that I encountered was to use exponentiation by squaring, but if taken literally that would first take 6 steps just to get to -n % n (we may as well start with that and then do only one square), and doesn't solve the biggest problem: how to do what _udiv128 does without _udiv128. Squaring -n % n and then reducing it mod n doesn't seem any better to me than multiplying -n % n by 2⁶⁴ and reducing it mod n, it just costs an extra multiplication (with 128-bit output) and then leaves us with an equally annoying problem.