Search code examples
assemblyx86intelmachine-code

Simple instruction encode


Let's take the following assembly instruction:

add    %cl,%bl

This gets encoded as: 00 cb, or 00000000 11001011 in binary. Putting the cb into the ModR/M bitfields, it looks like:

  1   1   0   0   1   0  1   1
+---+---+---+---+---+---+---+---+
|  mod  |    reg    |    r/m    |
+---+---+---+---+---+---+---+---+

And, inn looking up the register field here we get:

  • mod: 11 (Register addressing mode)
  • reg: 001 (cl register)
  • r/m: 011 (bl register)

And, I believe 000000ds is the add instruction, and d=s=0 since they're all registers. Is that a correct inderstanding of how this instruction is encoded? Additionally, for the 'full encoding' scheme, would the following be accurate (in bytes not bits):

[empty]         0x0         0b11001011     [empty]        [empty]          [empty]
_ _ _ _        _ _             _              _           _ _ _ _          _ _ _ _
Prefix      Instruction    Mod-reg-r/m      Scale       displacement      immediate

Are there any things I'm missing here in my attempt at 'decoding' the instruction?


Solution

  • Yes, looks right.

    The general pattern (for "legacy" ALU instructions that date back to 8086) for encoding op r/m, r vs. op r, r/m, and 8-bit vs. 16/32 bit does use the low 2 bits of the opcode byte in a regular pattern, but there's no need to rely on that.

    Intel does fully document exactly what's going on for each encoding of each instruction in their vol.2 manual. See the Op/En column and Operand Encoding table for add for example. (See also https://ref.x86asm.net/coder64.htm which also specifies which operand is which for every opcode). These both let you know which opcodes take a ModRM byte and which don't.

    These of course use Intel-syntax order. You're making your life more complicated by trying to follow manuals and tutorials while using AT&T syntax which reverses the order of the operand-list vs. Intel and AMD manuals.

    e.g. 00 /r is listed as MR operand encoding, which from the table we can see is operand 1 = ModRM:r/m (r, w), so it's read and written, and encoded by the r/m field. operand 2 = ModRM:reg (r), so it's a read-only source encoded by the reg field.

    Fun fact: 00 00 is add [rax], al, or AT&T add %al, (%rax)

    Note that you can ask GAS to pick the either encoding: x86 XOR opcode differences

    {load}  add    %cl,%bl        # 02 d9
    {store} add    %cl,%bl        # 00 cb
    

    See also Difference between MOV r/m8,r8 and MOV r8,r/m8