x86 x86-64 cpu-architecture cpu-registers

Why didn't the x86-32 architecture get more general purpose registers before x86-64?

The x86-32 architecture does have 8 x 32bit general purpose registers. These are EAX, EBX, ECX, EDX, ESI, EDI, ESP and EBP.

But why did it take until the x86-64 architecture for this number to double?

If you look at the 32-bit CPUs from the i386 to the Pentium 4 and the software world, then in the 32-bit era it happened that software no longer supported the older 32-bit CPUs because some feature was used that only the newer CPUs did had. The question therefore arises as to why the number of general purpose registers was not simply increased earlier?

Could it be because Intel wanted to push the Itanium architecture and was therefore not interested in more GPR for the x86 architecture? And if so, why didn't AMD try it on its own before introducing AMD64 in 2003?

Solution

It took a major ISA extension to add more bits for register numbers in instructions (via REX prefixes), and decoding was already a big problem for CPUs in that era. (Throughput bottleneck and bit power consumer.) Uop caches in Sandybridge and Zen families helped a lot with that, but that wasn't until 2010. (P4's trace cache would also help when it worked, but it was small and being a trace cache led to redundancy in its usage.)

32-bit mode was completely out of opcode coding space for new prefixes; VEX and EVEX only work in 32-bit mode as invalid encodings of existing instructions, and that level of overloading is presumably harder for decoders.

(But now that CPUs do support VEx and EVEX prefixes for AVX and AVX-512 respectively, they can use it for scalar stuff like BMI1/2. And in 64-bit mode, for the upcoming APX extension that doubles the number of GPRs to 32 with REX2 and EVEX prefixes, and gives 3-operand versions of existing integer instructions like xor.)

It wouldn't be attractive enough for most OSes to want to support a new mode of execution, and more importantly a new ABI with another version of libraries just for maybe 15% speedup. Only something fundamental like more address-space really justifies that much short-term effort in terms of multi-arch systems.

REX prefixes in AMD64 repurpose the 1-byte encodings of inc r32 / dec r32. Any mode that did something similar couldn't run existing 32-bit code, i.e. new code compiled for this mode couldn't call existing libraries, unless it called into them with a call far wrapper to switch to a legacy code-segment. (Perhaps some special return address format to allow their plain near ret to return back to the mode where 0x4? bytes are REX prefixes instead inc/dec instructions? Like high-bit set in CPL=3 means it's a special return address?) Generally far call/ret are not fast, but perhaps some special magic for switching between modes could be supported by the CPU including the decoders.

So anyway, it would still break compat with existing binary libraries unless some pretty hacky tricks were used, so doing it without also widening registers to 64-bit would be pretty silly in the era you're talking about. Intel PPro in 1995 supported PAE for wider physical addresses without wider virtual, because that just required a page-table change: only the kernel needed to know about that. By the late 90s, 4 GiB of physical RAM was already a thing for servers, and even home desktops were creeping up there.

Intel was adding SIMD extensions to x86 (SSE and SSE2), but not major new extensions that required a lot of toolchain and OS development work to use like adding new GPRs would have. As you say, they wanted OS and compiler devs to be spending their time working on Itanium support, not new modes for x86.

AMD64 was a pretty conservative design that left a lot of x86 warts in place, e.g. setcc r/m8 still only writes the low byte of a register so to get an int 0 or 1 you need to xor-zero the destination before the compare. And the weird partial-FLAGS semantics of rotates, and shifts leaving them unmodified if the count happens to be 0. These things minimize the amount of extra transistors in AMD K8 and later AMD CPUs that would be wasted in most home computers running 32-bit Windows, since they knew it wouldn't catch on quickly in binary ecosystems like Windows. A little bit of extra short-term work for compiler devs would have made all x86-64 programs more efficient in the long run, but capitalism favours the short-term considerations.

(Intel has been equally guilty of short-sighted design choices, like SSE2 cvtsi2sd xmm0, eax has an output dependency because it merges a new low half into the 16-byte XMM0. Because that's how SSE1 did it for cvtsi2ss, because PIII only had 64-bit wide SIMD execution units so it saved a uop to not have to write the high half as zero, which again is a short-term consideration; the SSE2 version is just being consistent to the detriment of out-of-order exec. But SSE2 was new in P4, which has 128-bit wide vector execution units; writing the upper half of a vector would have no extra cost and avoid an output dependency.)