Search code examples
assemblycpu-architecturex86-16emu8086opcode

Opcode vs Operand in x86 assembly source code


Recently in an exam, when asked about opcode vs operand, I gave an example

mov [ax],0000h

where I said the mov was the opcode and [ax],0000h was the operand and together they formed an instruction. My instructor gave me a 0 on the question and said that [ax] was the opcode and 0000h was the operand only.

In my textbook it says that in a MOV instruction the mov is the opcode and the source and destination are often called operands.

I wish to go to my instructor with the textbook and ask again, but before I do, can someone clear this up for me so I am not going in with any wrong understanding?

Tried to write correct answer, got 0 lol.


Solution

  • First of all, mov [ax], 0000h can't be represented in 8086 machine code. There's no binary representation for that destination addressing mode.

    TL:DR: mov is the mnemonic, [ax] is the destination operand, 0000h is the source operand. There is no binary "opcode" because the instruction is not encodeable. But if you're misusing "opcode" to talk about parts of the source line, you'd normally say that mov is the opcode.


    Opcodes are a feature of machine code, not assembly source code. Perhaps they're bending the terminology to talk about the instruction name, or they intended to talk about how it will assemble into machine code.

    In the asm source code mov [ax],0000h:

    • mov is the mnemonic, which says what instruction it is. This means the machine code will be using a mnemonic that's one of the few listed in the manual for that mnemonic (https://www.felixcloutier.com/x86/mov), with the assembler's choice depending on the operands.

      In this case a memory destination and an immediate source, but size not specified or implied by either, so could be C6 /0 ib MOV r/m8, imm8 or C7 /0 iw MOV r/m16, imm16. emu8086 is a bad assembler that doesn't warn you about the ambiguity in some cases, but might here where the value is zero.

    • [ax] is the destination operand. This is not encodeable in x86 machine code; it's not one of the few valid 16-bit invalid addressing modes.

    • 0000h is the source operand. Most instructions have an opcode that allows an immediate source.

    Unlike some earlier 8-bit machine, like 8080 that influenced some 8086 design decisions, both operands are explicit for most instructions, not just implied by an opcode. (Later extensions to x86 include some instructions with more than 2 operands, but x86 is still mostly a 2-operand ISA.)

    For comparison, see an 8080 opcode map https://pastraiser.com/cpu/i8080/i8080_opcodes.html
    vs. an 8086 opcode map like this, or a table like this. (Or a modern x86 32-bit mode opcode table, http://ref.x86asm.net/coder32.html which is the most nicely formatted and readable.) Note that in the 8080 map, each entry has at least a destination or both operands implied just by the opcode byte. But in 8086, usually just the mnemonic, with the operands encoded separately.


    So there's no combination of opcode and ModRM byte that can represent this instruction as a sequence of bytes of machine code.

    See How to tell the length of an x86 instruction? for a diagram summarizing the format of x86 machine code. (8086 didn't allow a SIB byte, hence the more limited addressing modes, but all other optional parts are still applicable. 8086 only has 1-byte opcodes, never 2 or 3, and of course immediates and displacements are at most 2 bytes.)

    If it was mov word ptr [si], 0000h, the machine code would be

             c7     04       00 00 
             ^      ^        ^
           opcode  ModR/M   imm16 immediate operand 
    

    The destination operand, [si] is encoded by the ModRM byte, using the 2 bit "mode" field (0) that specifies a memory addressing mode with no displacement (since it's not [si + 16] or something), and the 3-bit "r/m" field that specifies just si. See the table in https://wiki.osdev.org/X86-64_Instruction_Encoding#16-bit_addressing or in Intel or AMD's manuals.

    The opcode is the c7 byte and the 3-bit /r field of the ModRM byte (with value 0). See How to read the Intel Opcode notation for details on how this works, borrowing extra bits from ModRM as extra opcode bits. (That's why we have instructions like add ax, 123, not add cx, [si], 123 with a write-only destination and two separate sources including the immediate implied by the opcode, since ModRM can normally encode two operands as in add cx, [si]. Only the new 186 form of imul cx, [si], 123 allows that. Similarly neg dx instead of neg cx, dx)


    If it was mov ax, 0000h

       b8          00 00
        ^          ^
      Opcode       imm16 immediate source
    

    The AX destination is specified by the low 3 bits of the leading byte. You could look at this as 8 different opcode bytes, one for each register, with an implicit destination. That interpretation (of this different instruction, not the impossible one in your assignment) would sort of match up with your instructor's description of "mov-to-AX" as the opcode.

    But really you'd say mov ax, imm16 was the opcode, with the actual value to fill in the placeholder being the 0 operand. There are three other opcodes that can mov to AX:

    • 8B /r mov r16, r/m16 (example: mov ax, [si])
    • 89 /r mov r/m16, r16 (example: mov ax, si)
    • A1 mov ax, moffs (e.g. mov ax, [1234h]). Special case no-ModRM short-form with an absolute offset and an AL or AX destination.
    • And a 4th that wouldn't normally get used with a register destination because it's longer: C7 /0 iw mov r/m16, imm16 (e.g. a longer encoding of mov ax, 0).
    • Also 8C /r mov r/m16, Sreg (e.g. mov ax, ds).
    • Modern x86 has a few more forms, like mov r/m16, cr0..7 (new in 386) and mov r/m16, dr0..7 (386), but control registers didn't exist(?) until 286 smsw (store machine status word).

    Or you could look at it the way Intel's manual documents it, as B8+ rw iw being the encoding for MOV r16, imm16. So the opcode is the high 5 bits of the first byte, the destination register number is the low 3 bits of that byte. As with the memory destination form, the opcode itself implied the presence of a 16-bit immediate as the source operand.

    There's no ModR/M byte; the purpose of these short-form encodings was to save space for common instructions in 8086. There are similar no-modrm short forms, like xchg-with-AX which is where 90h nop comes from, xchg ax,ax. And for inc/dec of a full register. There are also no-ModRM short-forms for most ALU operations with the accumulator, e.g. add al, 123 is 2 bytes, vs. add bl, 123 is 3 bytes. (See code golf tips for x86 machine code).

    Note that mov ax, 0 is also encodeable with a 4-byte encoding, using the same mov r/m16, imm16 encoding, with a ModRM byte encoding the ax register as the destination. Assemblers normally choose the shortest possible encoding when there's a choice. (In some cases there are two choices the same length, like add cx, dx: see x86 XOR opcode differences)