assembly gcc clang x86-64 micro-optimization

Why do gcc and clang generate mov reg,-1

I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.

int compare_int64(int64_t left, int64_t right)
{
    return (left < right) ? -1 : (left > right) ? 1 : 0;
}

The point of this exercise is not for C++ where this code might be inlined anyway but when such functions are being called.

With -O3 this is the output:

clang:

xor     ecx, ecx
cmp     rdi, rsi
setg    cl
mov     eax, -1
cmovge  eax, ecx
ret

gcc:

xor     eax, eax
cmp     rdi, rsi
mov     edx, -1
setg    al
cmovl   eax, edx
ret

I noticed this code is 17 bytes in size which is just 1 byte over a nice 16 byte (the default code alignment for x64 in another non-C++ compiler I am using is 16). For the gcc code shown I was thinking of either using lea edx,[eax-1] or or edx,-1 (latter before the cmp of course) to reduce the code size. Interestingly when using -Os gcc inserts a jl instruction which is kinda disastrous for the performance of that function.

I am no expert and looked into the instruction tables manual by Agner Fog and if I am not mistaken for mov, lea and or the timings/latency are equal.

So the actual question(s): Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction? Would it be harmless to replace the mov reg,-1 with lea reg,[eax-1] or or reg,-1?

Solution

When optimizing for speed mov reg, -1 is used instead of or reg, -1 because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1, even though will always produce -1 is not recognized by the CPU as a dependency-breaking (write only) instruction.

To illustrate how it can affect performance:

mov eax, [rdi]  # Imagine a cache-miss here.
mov [rsi], eax
mov eax, -1     # `mov eax, -1` is able to dispatch and execute without waiting
                # for the cache-miss to be served.
add eax, edx    # `add eax, edx` only needs to wait 1 cycle for `mov` to
                # complete (assuming `edx` is ready) and then it can
                # dispatch while cache-miss load from a few lines above
                # is still in progress.

And now this code:

mov eax, [rdi]   # Imagine a cache-miss here.
mov [rsi], eax
or eax, -1       # Now this instruction has to wait for the cache-miss
                 # load to complete.
add eax, edx     # And this one will be waiting too.

(Example applies for any current x86-64 CPU, such as Skylake/Ice Lake/Zen).

If you're writing the code in assembly and certain that a register is not part of a dependency-chain that's currently in progress, you can use or reg, -1 and it'll have no negative effect (if your assumptions are right, of course).

Because of that danger of accidentally attaching to a dependency chain compilers do not generally use or reg, -1 for producing -1 when optimizing for speed.

When we need a zero, instead of -1, we're in luck because there are idioms that CPUs recognize, for example xor reg, reg and sub reg, reg. They're smaller in code size, and CPUs recognize that the result of the computation does not depend on the register (always zero).

These zero-idioms, on top of being smaller in code size, are also usually processed by the front-end part of the CPU, so the instructions depending on the result will immediately be able to dispatch.

Zero-idioms also work for vector registers: vpxor xmm0, xmm0, xmm0 (produce zero with no dependency on previous value of xmm0 and zero-latency). What is interesting is that vector registers also have a -1 idiom, which is vpcmpeqd xmm0, xmm0, xmm0 - this one is recognized as write-only (comparing the value with itself will always be true), but it still has to execute (so it has latency=1), at least on SKL/Zen CPUs.

More about producing zeros: What is the best way to set a register to zero in x86 assembly: xor, mov or and?

Specifically which idioms are recognized can be found in Agner Fog's manuals or CPU optimization guides. TLDR is general purpose registers only have zero-idioms, vector registers have zero-idioms and all-ones idioms.

See also: Set all bits in CPU register to 1 efficiently (mentions lea edx, [rax - 1]).

Note on the actual function. As you can see from assembly most of the work is actually trying to produce the specific constants that you've requested.

If all you intend to do with -1,0,1 is branch on whether it's negative/zero/positive, then it would be better to produce left - right (as long as you ensure there is no overflow because that would make the subtraction result alone insufficient for comparison - in that case just use -1, 0, 1) and then just branch/cmov on that.