Search code examples
assemblyarmcpu-architecturemicro-optimization

AND faster than integer modulo operation?


It is possible to re-express:

  • i % m

as:

  • i & (m-1)

where,

  • i is an unsigned integer
  • m is a power of 2

My question is: is the AND operation any faster? Don't modern CPUs support integer modulo in hardware in a single instruction? I'm interested in ARM, but don't see the modulo operation in its instruction set.


Solution

  • It's more complicated than "single instruction" these days. Modern CPUs are complex beasts and need their instructions broken down into issue/execute/latency. It also usually depends on the width of the divide/modulo - how many bits are involved.

    In any case, I'm not aware of 32 bit division being single cycle latency on any core, ARM or not. On "modern" ARM there are integer divide instructions, but only on some implementations, and most notably not on the most common ones - Cortex A8 and A9.

    In some cases, the compiler can save you the trouble of converting a divide/modulo into bit shift/mask operations. However, this is only possible if the value is known at compile time. In your case, if the compiler can see for sure that 'm' is always a power a two, then it'll optimize it to bit ops, but if it's a variable passed into a function (or otherwise computed), then it can't, and will resort to a full divide/modulo. This kind of code construction often works (but not always - depends how smart your optimizer is):

    unsigned page_size_bits = 12;     // optimization works even without const here
    
    unsigned foo(unsigned address) {
      unsigned page_size = 1U << page_size_bits;
      return address / page_size;
    }
    

    The trick is to let the compiler know that the "page_size" is a power of two. I know that gcc and variants will special-case this, but I'm not sure about other compilers.

    As a rule of thumb for any core - ARM or not (even x86), prefer bit shift/mask to divide/modulo, especially for anything that isn't a compile-time constant. Even if your core has hardware divide, it'll be faster to do it manually.

    (Also, signed division has to truncate towards 0, and div / remainder have be able to produce negative numbers, so even x % 4 is more expensive than x & 3 for signed int x.)