Disassembling write(1,"hi",3)
on linux, built with gcc -s -nostdlib -nostartfiles -O3
results in:
ba03000000 mov edx, 3 ; thanks for the correction jester!
bf01000000 mov edi, 1
31c0 xor eax, eax
e9d8ffffff jmp loc.imp.write
I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl
, dil
, and al
instead.
Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000
and mov $1, %al => b001
when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.
Can someone explain why did "GCC decide" that it doesn't matter?
Yes, GCC generally avoids writing to partial registers, unless optimizing for size (-Os
) instead of purely speed (-O3
). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:
char foo(char *p) { return *p; }
compiles to movzx eax, byte ptr [rdi]
instead of mov al, [rdi]
. https://godbolt.org/z/4ca9cTG9j
But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533
Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).
But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)
It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:
6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)
Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:
; Example 6.10a. Partial register stall
mov al, byte ptr [mem8]
mov ebx, eax ; Partial register stall
This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to
AL
to make it independent ofAH
. The execution unit has to wait until the write toAL
has retired before it is possible to combine the value fromAL
with the value of the rest ofEAX
.
Behaviour in different CPUs:
Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)
That's why compilers mostly avoid writing partial registers. They use movzx
/ movsx
whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.
As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)
Also note that write
takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3
. The size is determined by the type of the data, not the value of the data.
Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern.)
Finally, in certain contexts, C has default argument promotions to be aware of, though this is not the case.
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.
Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3
is actually mov edx, 3
, although both have the same effect—that is, to put 3 in the whole rdx
.
This is true because an immediate value of 3 doesn't require sign-extension and a MOV r32, imm32
implicitly clears the upper 32 bits of the register.