How can we check if ax is divisible by 16?
I know we can by this command:
AND 0x000f
There is a faster command? (I think idiv
is slower)
Yes, and
is one of the fastest instructions, same throughput and latency as instructions like add
, much faster than idiv
1. (https://uops.info/, https://agner.org/optimize/, Why does C++ code for testing the Collatz conjecture run faster than hand-written assembly? (because the compiler used a shift instead of DIV.)
If you're going to branch on it, test al, 0x0f
/ jnz not_multiple_of_16
is even faster on some CPUs (like AMD including Zen, or Intel Nehalem and earlier) that can macro-fuse TEST with JCC but not AND/JCC. TEST is like AND but only sets FLAGS without writing the destination. (So later reads of AX / EAX won't have the extra latency of AND in their path, and will be able to read the original value.)
Also, test al, imm8
is only 2 bytes, saving code-size vs. 3-byte and eax, 0xf
or 5-byte test eax, 0xf
2. (I'm assuming 32-bit or 64-bit mode; I didn't notice the AX in the title until after writing this, which hints that you might be optimizing for 16-bit mode. Not a significant difference overall.)
If you want to modify the register value itself to actually modulo by 16, then yes use and eax, 0xf
. (Not and al, 0xf
, you need to zero high bytes). Otherwise leave EAX unmodified and just write FLAGS.
Sandybridge-family can macro-fuse AND/JCC, but and al, 0xf
would write AL, introducing partial-register stalls if you wanted to read EAX later on P6-family CPUs (Nehalem and earlier). (Why doesn't GCC use partial registers?). On SnB, RMW operations don't rename AL separately from EAX, so there's no later merging needed there, and it's not a false dependency because you explicitly want to test
If you want a 0 / 1 result in another register, then test al, 0xf
/ setnz cl
is good.
xor ecx, ecx
test al, 0xf
setnz cl ; ECX = bool(x % 16U)
(If you're using setnz
or cmovnz
, there's no advantage to test
over and
, except for leaving EAX unmodified. Macro-fusion is only between test and conditional-branches, not setnz. So if you want the modify the register as well, as creating a boolean, you can use and
instead of test
here.)
Footnote 1: idiv
is much slower, and would need extra instructions to set up EDX, and put the divisor in another register, and doesn't set FLAGS according to the result in a useful way. In fact, div
and idiv
are the slowest integer math instructions on most CPUs, with only other heavily-microcoded instructions like int 0x80
or syscall
, or for example a big rep movsb
being slower.
Recent CPUs like Broadwell and later have fairly pretty high-performance HW division (https://uops.info/, https://agner.org/optimize/), and apparently Ice Lake improves it some more (especially for 64-bit operand-size), but compilers try hard to avoid division. e.g. compilers will use multiple other instructions to implement x / 10
, even for signed int x
where that takes not just a multiplicative inverse for division but also some sign-bit handling to round towards 0 even for negative numbers. Why does GCC use multiplication by a strange number in implementing integer division?
Of course for power-of-2 divisors, compilers know they can use AND.
unsigned mod16(unsigned x) {
return x % 16;
}
int mod16_signed(int x){
return x % 16; // negative for negative x, can't just use AND
}
asm on the Godbolt compiler explorer. Compiled with clang and GCC -O3 -m32 -mregparm=3
(so the first arg arrives in EAX, and is returned in EAX.)
# GCC and clang of course do this
mod16:
and eax, 15
ret
But signed is harder:
# clang12.0 -O3 -m32 -mregparm=3
mod16_signed:
lea ecx, [eax + 15]
test eax, eax # set FLAGS from x
cmovns ecx, eax # ecx = !(x<0) ? x : x+15
and ecx, -16 # round ecx down to a multiple of 16
sub eax, ecx # return x - round_down(ecx)
ret
Clang's has some instruction-level parallelism (LEA and TEST can run in parallel) and is probably best on modern CPUs with single-uop cmov
(AMD, and Intel Broadwell and later). GCC's has all 5 instructions dependent on the previous, so 5 cycle latency, vs. 4 for clang's. Or also 5 on CPUs with 2-uop cmov
.
GCC uses a somewhat different strategy, using cdq to broadcast the sign bit into EDX, then a right shift by 28 to get 0 or 0xf in EDX.
# gcc10.3 -O3 -m32 -mregparm=3
mod16_signed:
cdq
shr edx, 28 # edx = (x>=0) ? 0 : 0xf
add eax, edx # eax = (x>=0) ? x : x+15
and eax, 15
sub eax, edx # if(x<0) eax++
ret
Footnote 2: TEST doesn't have a test r/m32, sign_extended_imm8
form, unlike all other immediate instructions that were part of original 8086. It would rarely have been useful in original 8086, only for cases where you have the MSB set (so the upper half is all-ones) so you want to test if any bits are set except some in the low 8. e.g. test ax, -8
is like checking that ax >> 3
is non-zero.
x & 0 = 0,
so test al, 1
is always the same as test ax, 1
; it doesn't write the destination so only the FLAGS result matters. And you can do test ah, imm8
if you want, or test ax, imm16
was only 1 extra byte in 16-bit mode so it wasn't a lot of missed savings when 8086 was designed. It's a 3-byte difference for 32-bit operand-size, but modern CPUs often don't bottleneck on code fetch.
(Smaller is generally better to reduce overall L1i cache misses and usually pack better into the uop cache, and overall smaller binaries load from disk faster and have fewer iTLB misses, so compilers should favour smaller code when all else is equal. Often not worth using slower instructions to save code size, and still worth unrolling hot loops a bit.)