assembly x86 machine-code opcode instruction-encoding

How to determine if ModR/M is needed through Opcodes?

I am reading the ia-32 instruction format and found that ModR/M is one byte if required, but how to determine if it is required, someone says it is determined by Opcode, but how? I want to know the details, and is there some useful and authoritative documents which explain the details?

Solution

Intel's vol.2 manual has details on the encoding of operands for each form of each instruction. E.g. taking just the 8-bit operand size versions of the well-known add instruction, which has 2 reg,rm forms ; a rm,immediate form ; and a no-ModRM 2-byte short form for add al, imm8

Opcode    Instruction    | Op/En |  64-bit Mode | Compat/Leg Mode |  Description
04 ib     ADD AL, imm8   |  I    |   Valid           Valid         Add imm8 to AL.
80 /0 ib  ADD r/m8, imm8 |  MI   |   Valid           Valid         Add imm8 to r/m8.
00 /r     ADD r/m8, r8   |  MR   |   Valid           Valid         Add r8 to r/m8.
02 /r     ADD r8, r/m8   |  RM   |   Valid           Valid         Add r/m8 to r8.

And below that, the Instruction Operand Encoding ¶ table details what those I / MI / MR / RM codes from the Op/En (operand encoding) column above mean:

Op/En   | Operand 1        | Operand 2     | Operand 3  Operand 4
RM      | ModRM:reg (r, w) | ModRM:r/m (r) |  NA        NA
MR      | ModRM:r/m (r, w) | ModRM:reg (r) |  NA        NA
MI      | ModRM:r/m (r, w) | imm8/16/32    |  NA        NA
I       | AL/AX/EAX/RAX    | imm8/16/32    |  NA        NA

Notice that the "I" operand form doesn't mention a ModRM, so there isn't one. But MI does have one. (With the /r field being filled in with the /0 from the 80 /0 in the opcode table: full explanation with 83 /0 add r/m64, imm8 as an example.)

Notice that RM and MR differ only in whether the r/m operand (that can be memory) is the destination or source.

Most x86 ALU instructions have four reg, r/m opcodes, one for each direction (MR vs. RM) for each of 8-bit and non-8-bit. The non-8-bit form has a size determined by 66 operand-size prefix to flip between 16-bit and 32-bit, or REX.W for 64-bit, or none for the default operand-size (which is 32-bit except in 16-bit modes).

Plus the standard immediate form(s):

r/m8 bit with immediate (sharing an opcode byte overloaded via /digit)
r/m 16/32/64-bit with 8-bit sign-extended immediate (sharing an opcode byte overloaded via /digit)
r/m 16/32/64-bit with 16/32/sign_extended_32 bit immediate (sharing an opcode byte overloaded via /digit)
AL no modrm with 8-bit immediate (whole opcode byte to itself)
AX/EAX/RAX no modrm, imm16 / imm32 / sign_extended_imm32 (whole opcode byte to itself)

This is a lot of opcodes for every mnemonic, and is why 8086 didn't have room for more following the same pattern as the usual instructions. (Why are there no NAND, NOR and XNOR instructions in X86?)

See also https://wiki.osdev.org/X86-64_Instruction_Encoding which covers things more concisely than Intel's manual. Also note that you can check your understanding by assembling something with an assembler like NASM or GAS and looking at the machine code. Or just looking at disassembly of an existing program like objdump -drwC -Mintel /bin/ls | less

Some disassemblers even group bytes together in the machine code for each instruction, keeping a 4-byte immediate together as a group separate from opcode and modrm for example. (Agner Fog's objconv is like this.)