I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.
int compare_int64(int64_t left, int64_t right)
{
return (left < right) ? -1 : (left > right) ? 1 : 0;
}
The point of this exercise is not for C++ where this code might be inlined anyway but when such functions are being called.
With -O3 this is the output:
clang:
xor ecx, ecx
cmp rdi, rsi
setg cl
mov eax, -1
cmovge eax, ecx
ret
gcc:
xor eax, eax
cmp rdi, rsi
mov edx, -1
setg al
cmovl eax, edx
ret
I noticed this code is 17 bytes in size which is just 1 byte over a nice 16 byte (the default code alignment for x64 in another non-C++ compiler I am using is 16). For the gcc code shown I was thinking of either using lea edx,[eax-1]
or or edx,-1
(latter before the cmp
of course) to reduce the code size. Interestingly when using -Os gcc inserts a jl
instruction which is kinda disastrous for the performance of that function.
I am no expert and looked into the instruction tables manual by Agner Fog and if I am not mistaken for mov
, lea
and or
the timings/latency are equal.
So the actual question(s):
Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction?
Would it be harmless to replace the mov reg,-1
with lea reg,[eax-1]
or or reg,-1
?
When optimizing for speed mov reg, -1
is used instead of or reg, -1
because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1
, even though will always produce -1
is not recognized by the CPU as a dependency-breaking (write only) instruction.
To illustrate how it can affect performance:
mov eax, [rdi] # Imagine a cache-miss here.
mov [rsi], eax
mov eax, -1 # `mov eax, -1` is able to dispatch and execute without waiting
# for the cache-miss to be served.
add eax, edx # `add eax, edx` only needs to wait 1 cycle for `mov` to
# complete (assuming `edx` is ready) and then it can
# dispatch while cache-miss load from a few lines above
# is still in progress.
And now this code:
mov eax, [rdi] # Imagine a cache-miss here.
mov [rsi], eax
or eax, -1 # Now this instruction has to wait for the cache-miss
# load to complete.
add eax, edx # And this one will be waiting too.
(Example applies for any current x86-64 CPU, such as Skylake/Ice Lake/Zen).
If you're writing the code in assembly and certain that a register is not part of a dependency-chain that's currently in progress, you can use or reg, -1
and it'll have no negative effect (if your assumptions are right, of course).
Because of that danger of accidentally attaching to a dependency chain compilers do not generally use or reg, -1
for producing -1 when optimizing for speed.
When we need a zero, instead of -1, we're in luck because there are idioms that CPUs recognize, for example xor reg, reg
and sub reg, reg
. They're smaller in code size, and CPUs recognize that the result of the computation does not depend on the register (always zero).
These zero-idioms, on top of being smaller in code size, are also usually processed by the front-end part of the CPU, so the instructions depending on the result will immediately be able to dispatch.
Zero-idioms also work for vector registers: vpxor xmm0, xmm0, xmm0
(produce zero with no dependency on previous value of xmm0 and zero-latency). What is interesting is that vector registers also have a -1 idiom, which is vpcmpeqd xmm0, xmm0, xmm0
- this one is recognized as write-only (comparing the value with itself will always be true), but it still has to execute (so it has latency=1), at least on SKL/Zen CPUs.
More about producing zeros: What is the best way to set a register to zero in x86 assembly: xor, mov or and?
Specifically which idioms are recognized can be found in Agner Fog's manuals or CPU optimization guides. TLDR is general purpose registers only have zero-idioms, vector registers have zero-idioms and all-ones idioms.
See also: Set all bits in CPU register to 1 efficiently (mentions lea edx, [rax - 1]
).
Note on the actual function. As you can see from assembly most of the work is actually trying to produce the specific constants that you've requested.
If all you intend to do with -1,0,1 is branch on whether it's negative/zero/positive, then it would be better to produce left - right
(as long as you ensure there is no overflow because that would make the subtraction result alone insufficient for comparison - in that case just use -1, 0, 1) and then just branch/cmov on that.