Search code examples
assemblyoptimizationx86micro-optimization

Check if ax is divisible by 16


How can we check if ax is divisible by 16?
I know we can by this command:

AND 0x000f

There is a faster command? (I think idiv is slower)


Solution

  • Yes, and is one of the fastest instructions, same throughput and latency as instructions like add, much faster than idiv1. (https://uops.info/, https://agner.org/optimize/, Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? (because the compiler used a shift instead of DIV.)


    If you're going to branch on it, test al, 0x0f / jnz not_multiple_of_16 is even faster on some CPUs (like AMD including Zen, or Intel Nehalem and earlier) that can macro-fuse TEST with JCC but not AND/JCC. TEST is like AND but only sets FLAGS without writing the destination. (So later reads of AX / EAX won't have the extra latency of AND in their path, and will be able to read the original value.)

    Also, test al, imm8 is only 2 bytes, saving code-size vs. 3-byte and eax, 0xf or 5-byte test eax, 0xf2. (I'm assuming 32-bit or 64-bit mode; I didn't notice the AX in the title until after writing this, which hints that you might be optimizing for 16-bit mode. Not a significant difference overall.)

    If you want to modify the register value itself to actually modulo by 16, then yes use and eax, 0xf. (Not and al, 0xf, you need to zero high bytes). Otherwise leave EAX unmodified and just write FLAGS.

    Sandybridge-family can macro-fuse AND/JCC, but and al, 0xf would write AL, introducing partial-register stalls if you wanted to read EAX later on P6-family CPUs (Nehalem and earlier). (Why doesn't GCC use partial registers?). On SnB, RMW operations don't rename AL separately from EAX, so there's no later merging needed there, and it's not a false dependency because you explicitly want to test

    If you want a 0 / 1 result in another register, then test al, 0xf / setnz cl is good.

      xor   ecx, ecx
      test  al, 0xf
      setnz cl              ; ECX =  bool(x % 16U)
    

    (If you're using setnz or cmovnz, there's no advantage to test over and, except for leaving EAX unmodified. Macro-fusion is only between test and conditional-branches, not setnz. So if you want the modify the register as well, as creating a boolean, you can use and instead of test here.)


    Footnote 1: idiv is much slower, and would need extra instructions to set up EDX, and put the divisor in another register, and doesn't set FLAGS according to the result in a useful way. In fact, div and idiv are the slowest integer math instructions on most CPUs, with only other heavily-microcoded instructions like int 0x80 or syscall, or for example a big rep movsb being slower.

    Recent CPUs like Broadwell and later have fairly pretty high-performance HW division (https://uops.info/, https://agner.org/optimize/), and apparently Ice Lake improves it some more (especially for 64-bit operand-size), but compilers try hard to avoid division. e.g. compilers will use multiple other instructions to implement x / 10, even for signed int x where that takes not just a multiplicative inverse for division but also some sign-bit handling to round towards 0 even for negative numbers. Why does GCC use multiplication by a strange number in implementing integer division?

    Of course for power-of-2 divisors, compilers know they can use AND.

    unsigned mod16(unsigned x) {
        return x % 16;
    }
    
    int mod16_signed(int x){
        return x % 16;            // negative for negative x, can't just use AND
    }
    

    asm on the Godbolt compiler explorer. Compiled with clang and GCC -O3 -m32 -mregparm=3 (so the first arg arrives in EAX, and is returned in EAX.)

    # GCC and clang of course do this
    mod16:
            and     eax, 15
            ret
    

    But signed is harder:

    # clang12.0 -O3 -m32 -mregparm=3
    mod16_signed:
            lea     ecx, [eax + 15]
            test    eax, eax                 # set FLAGS from x
            cmovns  ecx, eax                 # ecx = !(x<0) ? x : x+15
            and     ecx, -16                 # round ecx down to a multiple of 16
            sub     eax, ecx                 # return  x - round_down(ecx)
            ret
    

    Clang's has some instruction-level parallelism (LEA and TEST can run in parallel) and is probably best on modern CPUs with single-uop cmov (AMD, and Intel Broadwell and later). GCC's has all 5 instructions dependent on the previous, so 5 cycle latency, vs. 4 for clang's. Or also 5 on CPUs with 2-uop cmov.

    GCC uses a somewhat different strategy, using cdq to broadcast the sign bit into EDX, then a right shift by 28 to get 0 or 0xf in EDX.

    # gcc10.3 -O3 -m32 -mregparm=3
    mod16_signed:
            cdq
            shr     edx, 28             # edx = (x>=0) ? 0 : 0xf
            add     eax, edx            # eax = (x>=0) ? x : x+15
            and     eax, 15
            sub     eax, edx            # if(x<0) eax++
            ret
    

    Footnote 2: TEST doesn't have a test r/m32, sign_extended_imm8 form, unlike all other immediate instructions that were part of original 8086. It would rarely have been useful in original 8086, only for cases where you have the MSB set (so the upper half is all-ones) so you want to test if any bits are set except some in the low 8. e.g. test ax, -8 is like checking that ax >> 3 is non-zero.

    x & 0 = 0, so test al, 1 is always the same as test ax, 1; it doesn't write the destination so only the FLAGS result matters. And you can do test ah, imm8 if you want, or test ax, imm16 was only 1 extra byte in 16-bit mode so it wasn't a lot of missed savings when 8086 was designed. It's a 3-byte difference for 32-bit operand-size, but modern CPUs often don't bottleneck on code fetch.

    (Smaller is generally better to reduce overall L1i cache misses and usually pack better into the uop cache, and overall smaller binaries load from disk faster and have fewer iTLB misses, so compilers should favour smaller code when all else is equal. Often not worth using slower instructions to save code size, and still worth unrolling hot loops a bit.)