assembly x86 machine-code opcode instruction-encoding

Why doesn't my assembler use the 05 opcode (add eax,imm32) short form the manual documents for ADD EAX,1 but it does for 04 ADD AL, 1?

I am writing an x86-64 assembler. I was looking through the Intel x86 manual volume 2, trying to understand how to generate the correct instructions from the assembly. I mostly understand how it works but have been assembling and disassembling instructions to check if I have it correct.

In the ADD reference table (Vol 2A, 3.31):

opcode        | Instruction  
04 ib         | ADD AL, imm8  
05 iw         | ADD AX, imm16  
05 id         | ADD EAX, imm32  
REX.W + 05 id | ADD RAX, imm32

Assemble:

;add.s   
add al, 1
add ax, 1
add eax, 1
add rax, 1

Disassemble:

.text:
   0:   04 01           add al, 1
   2:   66 83 c0 01     add ax, 1
   6:   83 c0 01        add eax, 1
   9:   48 83 c0 01     add rax, 1

So the first one is correct just like the manual says, but the assembler uses instructions further down the ADD reference table like the REX prefixes, why use those rather than the ones I listed previously?

Now the second one ADD ax, 1; after searching I found out the 66 was the operand-size override prefix but that is not listed in the ADD reference table, so when do I choose to add this prefix I cannot seem to find much information on it or the other legacy prefixes in the Intel manual?

I tried to disassemble 05 01 as shown in the manual but it didn't recognise it as an opcode just numbers. The Intel manual is a good resource I think it just lacks some extra explanation and structure, still trying to wrap my head around the ModRM stuff as well.

Solution

There are multiple opcodes for adding an immediate to a 64-bit register

Opcode	Instruction	Description
`REX.W + 05 id`	`ADD RAX, imm32`	Add imm32 sign-extended to 64-bits to RAX.
`REX.W + 81 /0 id`	`ADD r/m64, imm32`	Add imm32 sign-extended to 64-bits to r/m64.
`REX.W + 83 /0 ib`	`ADD r/m64, imm8`	Add sign-extended imm8 to r/m64.

Because 01 fits in a byte, your assembler uses opcode 83 to save instruction length. If you try add rax, 100000000 or something similar you'll get the opcode 05

Now to force another decoding instead of the more efficient one you'll need to define some syntax in your assembler. For example nasm uses the strict keyword

mov    eax, 1                ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 1   ; 7 bytes: REX mov r/m64, sign-extended-imm32.    NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason
mov    rax, strict qword 1   ; 10 bytes

Now if you look at the table closely you may see something "strange"

Opcode	Instruction	Description
`05 iw`	`ADD AX, imm16`	Add imm16 to AX.
`05 id`	`ADD EAX, imm32`	Add imm32 to EAX.
`81 /0 iw`	`ADD r/m16, imm16`	Add imm16 to r/m16.
`81 /0 id`	`ADD r/m32, imm32`	Add imm32 to r/m32.
`01 /r`	`ADD r/m16, r16`	Add r16 to r/m16.
`01 /r`	`ADD r/m32, r32`	Add r32 to r/m32.
`03 /r`	`ADD r16, r/m16`	Add r/m16 to r16.
`03 /r`	`ADD r32, r/m32`	Add r/m32 to r32.

Why do all the 16 and 32-bit versions of the same instruction have the same opcodes?

The answer is that the current mode will define the instruction type. If you're running in 16-bit mode then 16-bit registers will be used by default and if you're in 32 or 64-bit mode then the default size will be 32-bit. If you want to use the other size you'll have to use the 66h (Operand-size override) prefix. That means in 16-bit mode you'll get the below output instead of what you saw above

83 c0 01           add ax, 1
66 83 c0 01        add eax, 1

I tried to disassemble 05 01 as shown in the manual but it didn't recognise it as an opcode just numbers

Because 05 must be followed by a 4-byte immediate (id/imm32 as indicated in the manual) or a 2-byte immediate (iw/imm16) depending on the default operand size. Only instructions with imm8/ib can have a single byte immediate. For example the online disassembler gives me the below output:

0:  05 01 02 03 04          add    eax,0x4030201
5:  66 05 01 02             add    ax,0x201

For the same reason like above, the opcode 83h was chosen because 0x01 fits in a byte, making both the same length and the assembler can choose whatever it likes

0:  66 83 c0 01             add    ax,0x1
4:  66 05 01 00             add    ax,0x1

You may want to read this