I am writing an x86-64 assembler. I was looking through the Intel x86 manual volume 2, trying to understand how to generate the correct instructions from the assembly. I mostly understand how it works but have been assembling and disassembling instructions to check if I have it correct.
In the ADD reference table (Vol 2A, 3.31):
opcode | Instruction
04 ib | ADD AL, imm8
05 iw | ADD AX, imm16
05 id | ADD EAX, imm32
REX.W + 05 id | ADD RAX, imm32
Assemble:
;add.s
add al, 1
add ax, 1
add eax, 1
add rax, 1
Disassemble:
.text:
0: 04 01 add al, 1
2: 66 83 c0 01 add ax, 1
6: 83 c0 01 add eax, 1
9: 48 83 c0 01 add rax, 1
So the first one is correct just like the manual says, but the assembler uses instructions further down the ADD reference table like the REX prefixes, why use those rather than the ones I listed previously?
Now the second one ADD ax, 1
; after searching I found out the 66
was the operand-size override prefix but that is not listed in the ADD reference table, so when do I choose to add this prefix I cannot seem to find much information on it or the other legacy prefixes in the Intel manual?
I tried to disassemble 05 01 as shown in the manual but it didn't recognise it as an opcode just numbers. The Intel manual is a good resource I think it just lacks some extra explanation and structure, still trying to wrap my head around the ModRM stuff as well.
There are multiple opcodes for adding an immediate to a 64-bit register
Opcode | Instruction | Description |
---|---|---|
REX.W + 05 id |
ADD RAX, imm32 |
Add imm32 sign-extended to 64-bits to RAX. |
REX.W + 81 /0 id |
ADD r/m64, imm32 |
Add imm32 sign-extended to 64-bits to r/m64. |
REX.W + 83 /0 ib |
ADD r/m64, imm8 |
Add sign-extended imm8 to r/m64. |
Because 01
fits in a byte, your assembler uses opcode 83
to save instruction length. If you try add rax, 100000000
or something similar you'll get the opcode 05
Now to force another decoding instead of the more efficient one you'll need to define some syntax in your assembler. For example nasm uses the strict
keyword
mov eax, 1 ; 5 bytes to encode (B8 imm32) mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32. NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason mov rax, strict qword 1 ; 10 bytes
Now if you look at the table closely you may see something "strange"
Opcode | Instruction | Description |
---|---|---|
05 iw |
ADD AX, imm16 |
Add imm16 to AX. |
05 id |
ADD EAX, imm32 |
Add imm32 to EAX. |
81 /0 iw |
ADD r/m16, imm16 |
Add imm16 to r/m16. |
81 /0 id |
ADD r/m32, imm32 |
Add imm32 to r/m32. |
01 /r |
ADD r/m16, r16 |
Add r16 to r/m16. |
01 /r |
ADD r/m32, r32 |
Add r32 to r/m32. |
03 /r |
ADD r16, r/m16 |
Add r/m16 to r16. |
03 /r |
ADD r32, r/m32 |
Add r/m32 to r32. |
Why do all the 16 and 32-bit versions of the same instruction have the same opcodes?
The answer is that the current mode will define the instruction type. If you're running in 16-bit mode then 16-bit registers will be used by default and if you're in 32 or 64-bit mode then the default size will be 32-bit. If you want to use the other size you'll have to use the 66h (Operand-size override) prefix. That means in 16-bit mode you'll get the below output instead of what you saw above
83 c0 01 add ax, 1
66 83 c0 01 add eax, 1
I tried to disassemble 05 01 as shown in the manual but it didn't recognise it as an opcode just numbers
Because 05
must be followed by a 4-byte immediate (id/imm32
as indicated in the manual) or a 2-byte immediate (iw/imm16
) depending on the default operand size. Only instructions with imm8/ib
can have a single byte immediate. For example the online disassembler gives me the below output:
0: 05 01 02 03 04 add eax,0x4030201
5: 66 05 01 02 add ax,0x201
For the same reason like above, the opcode 83h was chosen because 0x01 fits in a byte, making both the same length and the assembler can choose whatever it likes
0: 66 83 c0 01 add ax,0x1
4: 66 05 01 00 add ax,0x1
You may want to read this