assembly x86 cpu-architecture hardware cpu-registers

Was there any advantage to the 386 architecture making 16-bit register arithmetic leave upper bits unchanged?

8 and 16 bit registers are usually avoided in modern x86 assembly because they're specified to leave the upper 56/48 bits unchanged, which creates performance issues due to partial register stalls, because those bits now need to be copied and merged into the destination register. When amd64 was created and the registers extended to 64 bits it was instead specified that 32-bit arithmetic would instead fill the upper 32 bits of the register with zeros, which was an obvious benefit due to avoiding partial register stalls on 32 bit arithmetic.

There are a few questions asking why the 32 bit behaviour exists, but I've not been able to find any answer or even speculation about why the 8/16 bit behaviour exists.

I think I understand why the behaviour exists for 8 bit registers - the original 8086 was intended to be (mostly) source compatible with the 8080, which had an 8 bit ALU but used 16 bit addresses, which were formed by pairs of 8 bit registers. Therefore the 8 bit register merging would be required for creating an instruction-to-instruction mapping from the 8080.

In addition to the 8/16 bit registers this behaviour is also seen with non-VEX encoded SSE instructions. SSE introduced vector registers that were 128 bits wide. AVX extended these to 256 bits, but decided that instructions modifying the lower 128 bits should preserve the upper half of the register, with the exception that 128 bit instructions that use the new VEX prefix would zero the upper half of the register.

I think that the reason that the behaviour of the SSE registers was done was because of ABI concerns, where one might want to call a function that was compiled before AVX which may need to save/restore the values in the vector registers, but because it only uses SSE instructions it's only able to save the lower 128 bits. Therefore the hardware automatically preserves the upper 128 bits when using instructions that don't indicate knowledge of it. Another possibility to solve this issue (that wouldn't involve any special hardware treatment) would've been to change the ABI for AVX-aware programs to treat the upper halves of AVX registers to be caller-saved but there may be reasons why that wasn't done.

I don't believe the backward compatibility and ABI concerns arguments work for 16 bit registers, however, and I'm struggling to find a alternative justification for it.

As I understand it, the 386 was the first x86 processor introduced that extended the registers to 32 bits. The opcodes for 16 bit operations were reused to be 32 bits by default, with the operand size override prefix needed to have 16 bit registers again. Therefore it was entirely feasible for the architecture to make 16 bit operations clear the upper half of the register, rather than the merging behaviour that we got. There was no ABI concern because any 16 bit save/restore instructions would automatically save the full 32 bit registers, and there was no backwards compatibility issues because none of the 16 bit programs would've had any idea that the registers were wider than 16 bits, unlike the 8080 programs which knew about the 8 bit register pairing. Basically, I can't seem to think of a reason as to why you would want the merging behaviour in preference to the clearing behaviour.

I understand that the rationale for having the 16 bit registers maintain the upper bits of the register may not have ever been explained but I've been wondering about this for a while and thought it worth asking in case anyone knows or at the very least has a good guess.

Solution

First of all, 386's choice of leaving the upper 16 bits unmodified is consistent with the behaviour when writing an 8-bit register. If out-of-order exec with register renaming still wasn't even on the radar for the architects of 386, it would be the obvious choice.

[...] the original 8086 was intended to be (mostly) source compatible with the 8080 [...]

Correct; the choice to map pairs of 8-bit 8080 regs to high/low halves of 16-bit 8086 regs pretty much required leaving the other half unmodified for simple automated translation schemes that don't need to consider context. And as you say, the fact that 8080 used pairs of 8-bit regs as addresses made this choice easier than mapping every 8080 register to the bottom of a different 16-bit register. (That would need 8086 to have addressing modes that merge two regs, or something)

The opcodes for 16 bit operations were reused to be 32 bits by default, with the operand size override prefix needed to have 16 bit registers again.

You're describing 32-bit protected mode. Perhaps you're forgetting that, unlike AMD64, new 386 features including wider register are available via prefixes in 16-bit modes.
When the default operand size is 16, an operand-size prefix makes it 32. (Same for address-size.)

For example, add eax, ecx assembles to 01 C8 in 32 or 64-bit mode, but 66 01 C8 for 16-bit mode. Vice-versa for add ax, cx. (Unlike memory addressing modes, register-direct uses the same numbers in 16 and 32-bit ModRM encodings.)

When 386 was new and Intel marketing would care most about selling it, most 386 CPUs would spend most of their time in 16-bit real mode running DOS programs, because DOS PCs were a lot of the market. Or running in 16-bit protected mode.

One use-case enabled by the actual 386 behaviour is programs using 32-bit registers under a single-tasking 16-bit-only OS (specifically DOS).

To context-switch, you'd need the OS to know about the full architectural state and save/restore it. But if you're not doing that, interrupt handlers (including system calls) only have to save/restore what they will modify themselves. If the OS code is purely 16-bit (e.g. written before 386 existed), then saving just the 16-bit low halves will Just Work even if user-space uses 32-bit registers. (I'm not talking about user-space switching to 32-bit mode and installing its own interrupt handlers, just using 32-bit regs in some functions.)

Even if you do have an updated (MS-)DOS, you might still have some old device drivers with their own interrupt handlers. Handling hardware interrupts, they can run between any two instructions, but they'll only save/restore 16-bit (low halves of) registers. Combining such a device driver (or other TSR terminate-and-stay-resident program) with "user-space" code that occasionally used 32-bit registers would be a recipe for mysterious crashes. And would go against the main selling point of the PC ecosystem: that old software continues to work on new hardware.

There was no ABI concern because any 16 bit save/restore instructions would automatically save the full 32 bit registers, and there was no backwards compatibility issues because none of the 16 bit programs would've had any idea that the registers were wider than 16 bits

Changing something like push bp (55) to decode as push ebp even in 16-bit mode would be ABI-breaking. For example, a function that does push bp / mov bp, sp expects to find its first stack arg at [bp+4], not [bp+6] (or [bp+8] with 32-bit return addresses as well). And code that did push/push / call / add sp, 4 would unbalance the stack.

I can't think of a plausible way to have context-dependent behaviour that would work around this; the CPU doesn't know whether it's "in an interrupt handler" or not, and interrupt handlers will use push/pop for both saving/restoring user-space and for their own internal function calls. (Fun fact: the PIC / APIC does know for external interrupts.)

You also mention AVX YMM registers. In both mainstream calling conventions (x86-64 SysV and Win x64), YMM uppers are call-clobbered, so callees can use AVX and then do a vzeroupper before calling functions that might use legacy-SSE.

The legacy-SSE behaviour of preserving YMM uppers is due to binary-only Windows kernel drivers that manually save/restore a couple XMM, not using the kernel's "save FP/SIMD state" function which would be updated to be AVX-aware. (e.g. Linux's kernel_fpu_begin().)

If writing an XMM reg with a legacy-SSE instruction had zero-extended like it does for AVX (VEX) encodings, this would destroy user-space state in interrupt handlers, again in a mysterious and hard-to-debug way. Same as if legacy DOS drivers zeroed the upper halves of 32-bit integer regs in interrupt handlers.

So basically we can thank Microsoft and their binary-only Windows software ecosystem (including device drivers) for the existence of vzeroupper and the performance problems that can happen if you don't use it.

And this is why no similar nonsense was needed for mixing AVX and AVX-512; AVX already defined VEX encodings as zero-extending out to any future wider width, so it isn't safe to manually save/restore a couple YMM regs in a device driver this way. So CPU architects can assume there aren't binaries doing that. (What is the penalty of mixing EVEX and VEX encoded scheme?* - none.)

Fun fact: the lack of a future-proof way to save/restore the full width of a vector reg is why x86-64 System V made the unfortunate decision that all the XMM registers would be call-clobbered. Because they didn't consider or didn't like the possibility of making just the XMM part (or just the low 64 bits for scalar double) call-preserved, and any higher parts call-clobbered (which is how Win x64 works for Y/ZMM6-31).
At least 2 or 3 call-preserved would have been quite helpful in code that makes math library function calls. (But not most of them like Win x64; that's far too many; often when you want vector regs, you want a lot of them in a loop that doesn't make any calls.)