x86-64 intel endianness cpu-word machine-code

Are machine code instructions fetched in little endian 4-byte words on an Intel x86-64 architecture?

Despite a common definition for word (as stated on Wikipedia) being:

The largest possible address size, used to designate a location in memory, is typically a hardware word (here, "hardware word" means the full-sized natural word of the processor, as opposed to any other definition used).

x86 systems, according to some sources, note it's treated as 16 bits:

In the x86 PC (Intel, AMD, etc.), although the architecture has long supported 32-bit and 64-bit registers, its native word size stems back to its 16-bit origins, and a "single" word is 16 bits. A "double" word is 32 bits. See 32-bit computer and 64-bit computer.

Yet Intel's official documentation (sdm vol 2, section 1.3.1) states:

this means the bytes of a word are numbered starting from the least significant byte. Figure 1-1 illustrates these conventions.

and Figure 1-1 shows 4 bytes in little endian sequence, not 2 bytes or 8 bytes (as the varying definition by sources linked above would suggest) of word in the x86-64 context:

And where my confusion really lies about all this is how instructions are fetched and parsed. I'm writing an emulator and once I parse a PE formatted executable and get to the text section, if I'm to follow the 4-byte little endian format, doesn't that mean the 4th byte would be parsed first?

Let's make up some bytes for example:

.text segment buffer:
< 0x10, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20 > ....

Would I parse the first instruction as 1C, 1B, 1A, 10, 20, 1F, 1E, 1D ... (and so on, being variable length there's obviously potentially more words to read depending on what the real bytes are here)?

Solution

No, x86 machine code is a byte-stream; there's nothing word-oriented about it, except for 32-bit displacements and immediates which are little-endian. e.g. in add qword [rdi + 0x1234], 0xaabbccdd. It's physically fetched in 16-byte or 32-byte chunks on modern CPUs, and split on instruction boundaries in parallel to feed to decoders in parallel.

48    81   87     34 12 00 00    dd cc bb aa       
REX.W add ModRM    le32 0x1234    le32 0xaabbccdd le32 (sign-extended to 64-bit)

   add    QWORD PTR [rdi+0x1234],0xffffffffaabbccdd

x86-64 is not a word-oriented architecture; there is no single natural word-size, and things don't have to be aligned. That concept is not very useful when thinking about x86-64. The integer register width happens to be 8 bytes, but that's not even the default operand-size in machine code, and you can use any operand-size from byte to qword with most instructions, and for SIMD from 8 or 16 byte up to 32 or 64 byte. And most importantly, alignment of wider integers isn't required in machine code, or even for data.

Some people like to fit a square peg into a round hole and describe x86 in terms of machine-words, but that concept only really fits well for RISC ISAs that are designed around a single word size. (Fixed instruction length, register size, and even data memory load/store is required to be word aligned for word-sized accesses on some RISCs, although modern ones often allow unaligned load/store with some performance penalty.)

(To be fair, 64-bit RISCs are usually also equally efficient with 32 and 64-bit integers. But unlike x86 they can't do add ax, cx that avoids propagating carry into the higher bits of a register. Although RISCs can do a 16-bit store after some math on sign-extending or zero-extending load results).

Are there any modern CPUs where a cached byte store is actually slower than a word store? x86 byte / unaligned word/dword store is more efficient than on many RISCs.

according to some sources, note it's treated as 16 bits:

Yes, in x86 terminology / documentation, a "word" is 16 bits, because modern x86-64 evolved out of 8086 and it would have been silly to change the meaning of a term in the documentation everyone had been using for years when 386 was released. Hence paddw packed add of 16-bit SIMD elements, and movsw/stosw/etc. string instructions.

An x86 16-bit "word" has absolutely zero connection to the concept of a "machine word" in CPU architecture.

On 8086 through 286, 16-bit was the register and bus width, and the only integer operand-size other than byte you can use for most ALU instructions. But those CPUs were still very much not based around "words" the way MIPS is; The machine-code format was still the same, with unaligned little-endian 16-bit immediates and displacements. (8088 was identical to 8086, except for the 8-bit bus-interface and 4-byte instruction prefetch buffer instead of 6-byte.)