assembly x86-16 cpu-architecture processor

Loading program from RAM in 8086

The 8086 is using 16-bit instruction but the RAM addresses only hold 8-bit how does the CPU load programms from the RAM then ? Does it load one address and then checks if the instruction needs 1/2/3 bytes (e.g. moving a immediate to a register 8/16 bit) and then executes the operation or am I getting it wrong that one RAM 'space' is 16-bit big ?

Solution

Many instructions are multi-byte, and yes that means they span two or more addresses.

8086's memory bus is 16-bit, so it can load 16 bits (two adjacent addresses) in a single operation. You're confusing byte-addressable memory with the bus width.

Does it load one address and then checks if the instruction needs 1/2/3 bytes (e.g. moving a immediate to a register 8/16 bit)

It continually fetches instruction bytes into a 6-byte prefetch buffer (2 bytes at a time, because it's a 16-bit CPU with 16-bit busses), when the bus isn't busy with data accesses triggered by the instruction that's running.

The buffer is large enough to hold the largest allowed 8086 instruction¹ (excluding prefixes, which are handled one per clock cycle before the CPU gets to the opcode). When it's done executing the previous instruction, it looks at the buffer. See the link below for a better description, but it probably tries to decode the buffer as a whole instruction, or at least find an opcode, otherwise waits for the next fetch to try again. (I'm not sure how much it can pipeline fetching of later bytes for longer instructions; if it can start executing while that happens.)

Note 1: But 8088, with its 8-bit bus, shrinks the prefetch buffer to 4 bytes, see this retrocomputing Q&A. But apparently 8088 has the same transistor layout except for the Bus Interface Unit (BIU). So it, and therefore 8086, must not depend on being able to hold a whole instruction in the prefetch buffer, because 8088 can execute mov word [0x1234], 0x5678 (6 bytes: opcode + modrm + disp16 + imm16). But the opcode + modrm is only 2 bytes, with more bytes for a disp8 or disp16 in the addressing mode, and/or imm8 or imm16 immediate, so presumably those can get fetched / decoded later.

This 8086 gate-level reverse-engineering article, Latches inside: Reverse-engineering the Intel 8086's instruction register, says the 8086's actual instruction register is 1 byte, holding the opcode of the currently-executing instruction. (It wasn't until later CPUs that any 0F xx 2-byte opcodes were introduced).

See also: 8086 CPU architecture, which was the first hit for "8086 code fetch". It confirms that fetch and execute do overlap, so it's pipelined in the most basic way.

TL:DR: It fetches into a buffer until it has a whole instruction to decode. Then it shifts any extra bytes to the front of the buffer, because they're part of the next instruction.

I've read that usually instruction-fetch is the bottleneck for 8086, so optimizing for code-size outweighed pretty much everything else.

A pipelined CPU wouldn't have to wait for execution of the previous instruction to finish to get started on decoding. Modern CPUs also have much higher bandwidth code-fetch, so they have a queue of decoded instructions ready to go (except when branches mess this up.) See http://agner.org/optimize/, and other links in the x86 tag wiki.

Also, some very common instructions are a single byte, like push r16.