Search code examples
assemblygccx86x86-64cpu-architecture

Why doesn't GCC use partial registers?


Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:

ba03000000     mov edx, 3 ; thanks for the correction jester!
bf01000000     mov edi, 1
31c0           xor eax, eax
e9d8ffffff     jmp loc.imp.write

I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl, dil, and al instead. Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000 and mov $1, %al => b001 when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.

Can someone explain why did "GCC decide" that it doesn't matter?


Solution

  • Yes, GCC generally avoids writing to partial registers, unless optimizing for size (-Os) instead of purely speed (-O3). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:

    char foo(char *p) { return *p; } compiles to movzx eax, byte ptr [rdi]
    instead of mov al, [rdi]. https://godbolt.org/z/4ca9cTG9j

    But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533


    Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).

    But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)

    It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:

    6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)

    Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
    Example:

    ; Example 6.10a. Partial register stall
    mov al, byte ptr [mem8]
    mov ebx, eax ; Partial register stall
    

    This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX.

    Behaviour in different CPUs:

    Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)

    That's why compilers mostly avoid writing partial registers. They use movzx / movsx whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.

    As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)


    Also note that write takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3. The size is determined by the type of the data, not the value of the data.

    Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern.)

    Finally, in certain contexts, C has default argument promotions to be aware of, though this is not the case.
    Actually, as RossRidge pointed out, the call was probably made without a visible prototype.


    Your disassembly is misleading, as @Jester pointed out.
    For example mov rdx, 3 is actually mov edx, 3, although both have the same effect—that is, to put 3 in the whole rdx.
    This is true because an immediate value of 3 doesn't require sign-extension and a MOV r32, imm32 implicitly clears the upper 32 bits of the register.