Search code examples
x86armhardwarecpu-architectureprocessor

Could a processor be made that supports multiple ISAs? (ex: ARM + x86)


Intel has been internally decoding CISC instructions to RISC instructions since their Skylake(?) architecture and AMD has been doing so since their K5 processors. So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution? If that is what is happening, then I wonder if its possible to create a processor that understands (i.e, internally translates to its own proprietary instructions) both x86 and ARM instructions. If that is possible, what would the performance be like? And why hasn't it been done already?


Solution

  • The more different the ISAs, the harder it would be. And the more overhead it would cost, especially the back-end. It's not as easy as slapping a different front-end onto a common back-end microarchitecture design.

    If it was just a die area cost for different decoders, not other power or perf differences, that would be minor and totally viable these days, with large transistor budgets. (Taking up space in a critical part of the chip that places important things farther from each other is still a cost, but that's unlikely to be a problem in the front-end). Clock or even power gating could fully power down whichever decoder wasn't being used. But as I said, it's not that simple because the back-end has to be designed to support the ISA's instructions and other rules / features; CPUs don't decode to a fully generic / neutral RISC back-end. Related: Why does Intel hide internal RISC core in their processors? has some thoughts and info about what what the internal RISC-like uops are like in modern Intel designs.

    Adding ARM support capability to Skylake for example would make it slower and less power-efficient when running pure x86 code, as well as cost more die area. That's not worth it commercially, given the limited market for it, and the need for special OS or hypervisor software to even take advantage of it. (Although that might start to change with AArch64 becoming more relevant thanks to Apple.)

    A CPU that could run both ARM and x86 code would be significantly worse at either one than a pure design that only handles one.

    • efficiently running 32-bit ARM requires support for fully predicated execution, including fault suppression for loads / stores. (Unlike AArch64 or x86, which only have ALU-select type instructions like csinc vs. cmov / setcc that just have a normal data dependency on FLAGS as well as their other inputs.)

    • ARM and AArch64 (especially SIMD shuffles) have several instructions that produce 2 outputs, while almost all x86 instructions only write one output register. So x86 microarchitectures are built to track uops that read up to 3 inputs (2 before Haswell/Broadwell), and write only 1 output (or 1 reg + EFLAGS).

    • x86 requires tracking the separate components of a CISC instruction, e.g. the load and the ALU uops for a memory source operand, or the load, ALU, and store for a memory destination.

    • x86 requires coherent instruction caches, and snooping for stores that modify instructions already fetched and in flight in the pipeline, or some way to handle at least x86's strong self-modifying-code ISA guarantees (Observing stale instruction fetching on x86 with self-modifying code).

    • x86 requires a strongly-ordered memory model. (program order + store buffer with store-forwarding). You have to bake this in to your load and store buffers, so I expect that even when running ARM code, such a CPU would basically still use x86's far stronger memory model. (Modern Intel CPUs speculatively load early and do a memory order machine clear on mis-speculation, so maybe you could let that happen and simply not do those pipeline nukes. Except in cases where it was due to mis-predicting whether a load was reloading a recent store by this thread or not; that of course still has to be handled correctly.)

      A pure ARM could have simpler load / store buffers that didn't interact with each other as much. (Except for the purpose of making stlr / ldapr / ldar release / acquire / acquire-seq-cst cheaper, not just fully stalling.)

    • Different page-table formats. (You'd probably pick one or the other for the OS to use, and only support the other ISA for user-space under a native kernel.)

    • If you did try to fully handle privileged / kernel stuff from both ISAs, e.g. so you could have HW virtualization with VMs of either ISA, you also have stuff like control-register and debug facilities.

    Update: Apple M1 does support a strong x86-style TSO memory model, allowing efficient+correct binary translation of x86-64 machine code into AArch64 machine code, without needing to use ldapr / stlr for every load and store. It also has a weak mode for running native AArch64 code, toggleable by the kernel.

    In Apple's Rosetta binary translation, software handles all the other issues I mentioned; the CPU is just executing native AArch64 machine code. (And Rosetta only handles user-space programs, so there's no need to even emulate x86 page-table formats and semantics like that.)


    This already exists for other combinations of ISAs, notably AArch64 + ARM, but also x86-64 and 32-bit x86 have slightly different machine code formats, and a larger register set. Those pairs ISAs were of course designed to be compatible, and for kernels for the new ISA to have support for running the older ISA as user-space processes.

    At the easiest end of the spectrum, we have x86-64 CPUs which support running 32-bit x86 machine code (in "compat mode") under a 64-bit kernel. They fully use the same pipeline fetch/decode/issue/out-of-order-exec pipeline for all modes. 64-bit x86 machine code is intentionally similar enough to 16 and 32-bit modes that the same decoders can be used, with only a few mode-dependent decoding differences. (Like inc/dec vs. REX prefix.) AMD was intentionally very conservative, unfortunately, leaving many minor x86 warts unchanged for 64-bit mode, to keep decoders as similar as possible. (Perhaps in case AMD64 didn't even catch on, they didn't want to be stuck spending extra transistors that people wouldn't use.)

    AArch64 and ARM 32-bit are separate machine-code formats with significant differences in encoding. e.g. immediate operands are encoded differently, and I assume most of the opcodes are different. Presumably pipelines have 2 separate decoder blocks, and the front-end routes the instruction stream through one or the other depending on mode. Both are relatively easy to decode, unlike x86, so this is presumably fine; neither block has to be huge to turn instructions into a consistent internal format. Supporting 32-bit ARM does mean somehow implementing efficient support for predication throughout the pipeline, though.

    Early Itanium (IA-64) also had hardware support for x86, defining how the x86 register state mapped onto the IA-64 register state. Those ISAs are completely different. My understanding was that x86 support was more or less "bolted on", with a separate area of the chip dedicated to running x86 machine code. Performance was bad, worse than good software emulation, so once that was ready the HW designs dropped it. (https://en.wikipedia.org/wiki/IA-64#Architectural_changes)

    So does this mean that the x86 instructions get translated to some weird internal RISC ISA during execution?

    Yes, but that "RISC ISA" is not similar to ARM. e.g. it has all the quirks of x86, like shifts leaving FLAGS unmodified if the shift count is 0. (Modern Intel handles that by decoding shl eax, cl to 3 uops; Nehalem and earlier stalled the front-end if a later instruction wanted to read FLAGS from a shift.)

    Probably a better example of a back-end quirk that needs to be supported is x86 partial registers, like writing AL and AH, then reading EAX. The RAT (register allocation table) in the back-end has to track all that, and issue merging uops or however it handles it. (See Why doesn't GCC use partial registers?).