Search code examples
assemblyx86x86-16tasmemu8086

Key Differences Between `mov byte ptr` and `mov word ptr`


I've been reading past posts on

  1. What ptr does?
  2. What [] does?

in assembly, but I'm still struggling to understand the following code:

    Title : Program failed to comprehend
    .model small
    .stack 100h
    .data
         Msg db 10,13, 'Zahid. $'   
    .code
    .startup
         ; Initialising data segment
          mov ax, @data
          mov dx, ax
        ;Before operation displaying message
         mov dx, offset msg
         mov ah,09h
         int 21h

          mov msg , 'A'          ; Writing on memory specified by msg thats OK -> A
          mov msg+1 , 'R'     ; Confusion as writing on memory specified by msg then add 1(not 8bits next address write) -> A
          mov [msg]+2, 'I'     ; confusion: Writing on memory specified by msg then add 2 value to that address write-> I
          mov byte ptr [msg+3] , 'F'      ; Ok with me, writing on byte memory specified by msg+3 -> F 
          mov word ptr [msg + 4], '.'      ; Confused again as writing on word memory specified by msg+4, '.' Will it write ascii on the two bytes-> .
          mov msg[5] , '$'   ; Not confused on this.
  
       ;Print var
        mov dx, offset msg
        mov ah,09h
        int 21h
     
        ;Exit Code.
        mov ah,04ch
        xor al,al
        int 21h
         
    Ends

Output:
Zahid. ARIF.

Please explain the operation as I believe it should not print 'ARIF'??


Solution

  • In assembly the syntax depends on particular assembler. Emu8086 is mostly following MASM dialect, which is quite relaxed in rules and allows for several different options (with same output).

    If you are used to some high level programming language, this may feel confusing, why the syntax is not set in stone and how to live with this mess in asm.

    But for asm programmer this is rarely an issue, because in assembly you don't build some runtime expression with operators and different values, instruction from source is usually 1:1 mapped to one of CPU instructions, with the exact arguments and options of the particular instruction which exists in CPU.

    The MOV instruction on x86 is a bit mess itself, as it is single mnemonics "MOV" used for many different instruction opcodes, but in your example only two instructions are used: MOV r/m8,imm8 with opcode C6 for storing byte values, and MOV r/m16,imm16 with opcode C7 to store word value. And in all cases that r/m part is memory reference by absolute offset, which is calculated during compile time.

    So if msg is symbol for memory address 0x1000, then those lines in your question compile as:

    ; machine code  | disassembled instruction from machine code
    
    C606001041        mov byte [0x1000],0x41
    

    Store byte value 0x41 ('A') into memory at address ds:0x1000. The C6 06 is MOV [offset16],imm8 instruction opcode, the 00 10 bytes are 0x1000 offset16 itself (little endian) and finally the 41 is the imm8 value 0x41. Segment ds will be used to calculate full physical memory address by default, because there's no segment override prefix ahead of that instruction.

    C606011052        mov byte [0x1001],0x52
    C606021049        mov byte [0x1002],0x49
    C606031046        mov byte [0x1003],0x46
    C70604102E00      mov word [0x1004],0x2e
    C606051024        mov byte [0x1005],0x24
    

    Remaining lines are the same story, writing byte values at specific memory offsets, going byte by byte in memory, overwriting every one of them.

    With the subtle difference of mov word ptr [msg + 4], '.', which does target memory address ds:0x1004 similarly like other lines, but the value stored is imm16, i.e. word value, equal to 0x002E ('.'), so the different opcode C7 is used, and the immediate value needs two bytes 2E 00. This one will overwrite memory at address ds:0x1004 with byte 2E, and ds:0x1005 with byte 00.

    So if the memory at address msg (ds:0x1000 in my examples) was at the beginning:

    0x1000: 0A 0D 5A 61 68 69 64 2E 20 24  ; "\n\rZahid. $"
    

    It will change to this after each MOV executed:

    0x1000: 41 0D 5A 61 68 69 64 2E 20 24  ; "A\rZahid. $"
    0x1000: 41 52 5A 61 68 69 64 2E 20 24  ; "ARZahid. $"
    0x1000: 41 52 49 61 68 69 64 2E 20 24  ; "ARIahid. $"
    0x1000: 41 52 49 46 68 69 64 2E 20 24  ; "ARIFhid. $"
    0x1000: 41 52 49 46 2E 00 64 2E 20 24  ; "ARIF.\0d. $"
    

    That word did overwrite two bytes, both 'h' (with dot) and 'i' (with zero).

    0x1000: 41 52 49 46 2E 24 64 2E 20 24  ; "ARIF.$d. $"
    

    And that zero is overwritten one more time to dollar sign (string terminator for the DOS int 21h service ah=9).

    Generally the relaxed syntax is not a problem, because you can't build your own instruction, the assembler will guess which one of the existing ones fits, and compile whatever expression you have into it. There's no instruction on x86 like mov [address1] and [address2], value storing same value at two different memory locations, or mov [address]+2 which would add two to the memory value at address (that's possible to do with add [address], 2 which is one off the add r/m,imm variants, depending on the data size).

    So mov msg+1,... can be only memory address msg + 1, there's no other meaningful possibility in x86 instruction set. And the data size byte is deducted from the db directive used after label msg:, this is speciality of MASM and emu8086 assemblers, most of the other assemblers don't link any defined label (symbol) with directive used after it, i.e. no "types" of symbols in common assemblers. For those the mov msg+1,'R' may end with syntax error, but not because the left side is problematic, but they will not know how big the 'R' value should be (how many bytes).

    My personal favourite NASM would report another error on it, as it requires the brackets around memory access, so in NASM only mov [msg+2],... would be valid (with size modifier like "byte ptr" in MASM allowed, but without "ptr": mov byte [msg+2],.... But in MASM/emu8086 all the variants you used are valid syntax with same meaning, producing memory reference by 16b offset.

    The assembler will also not produce two instructions instead of single (exception may be special "pseudo-instructions" in some assemblers, which are compiled to several native instructions, but that is not common in x86 assembly).

    Once you know the target CPU instruction set, what instructions do exist, you will be able to guess from the vague syntax easily, which target instruction will be produced.

    Or you can easily check in debugger disassembly window, as the disassembler will use only single way of syntax for particular instruction, not aware of the source formatting ambiguities.

     mov word ptr [msg + 4], '.'
       ; Confused again as writing on word memory specified by msg+4,
         '.' Will it write ascii on the two bytes-> .
    

    It will write on two bytes, that's what WORD PTR in MASM specifies. But the value is only '.' = 0x2E. But 0x2E is perfectly valid even as 16 bit value, simply extended with zeroes to 0x002E, and that's the value used by the assembler for this line.

    In future, if you are not sure, how particular thing assembles, and what it will do to the CPU/memory state, just use the emu8086 debugger. If you would in this case, you would see in the disassembly window that all those variants of msg+x did compile to the memory addresses going byte by byte over original msg memory. Also if you would open some memory view (I hope emu8086 has one, I don't use it) at msg address, you could watch each write to memory, how it does change the original values, and how that WORD PTR works, as you were not sure. Watching in debugger is usually lot more easier than reading these long answers on stack overflow...

    About what PTR does: In assembly, what does `PTR` stand for? ... doesn't explain it well, as it's hard to explain it well, the whole "BYTE PTR" is the term used by MASM, it's not parsing it as BYTE and then doing something PTR to the BYTE, but it will parse it as "BYTE PTR" and be like "okay, he want to address byte". It's like single keyword, but with space.