I've been reading past posts on
ptr
does?[]
does?in assembly, but I'm still struggling to understand the following code:
Title : Program failed to comprehend
.model small
.stack 100h
.data
Msg db 10,13, 'Zahid. $'
.code
.startup
; Initialising data segment
mov ax, @data
mov dx, ax
;Before operation displaying message
mov dx, offset msg
mov ah,09h
int 21h
mov msg , 'A' ; Writing on memory specified by msg thats OK -> A
mov msg+1 , 'R' ; Confusion as writing on memory specified by msg then add 1(not 8bits next address write) -> A
mov [msg]+2, 'I' ; confusion: Writing on memory specified by msg then add 2 value to that address write-> I
mov byte ptr [msg+3] , 'F' ; Ok with me, writing on byte memory specified by msg+3 -> F
mov word ptr [msg + 4], '.' ; Confused again as writing on word memory specified by msg+4, '.' Will it write ascii on the two bytes-> .
mov msg[5] , '$' ; Not confused on this.
;Print var
mov dx, offset msg
mov ah,09h
int 21h
;Exit Code.
mov ah,04ch
xor al,al
int 21h
Ends
Output:
Zahid. ARIF.
Please explain the operation as I believe it should not print 'ARIF'??
In assembly the syntax depends on particular assembler. Emu8086 is mostly following MASM dialect, which is quite relaxed in rules and allows for several different options (with same output).
If you are used to some high level programming language, this may feel confusing, why the syntax is not set in stone and how to live with this mess in asm.
But for asm programmer this is rarely an issue, because in assembly you don't build some runtime expression with operators and different values, instruction from source is usually 1:1 mapped to one of CPU instructions, with the exact arguments and options of the particular instruction which exists in CPU.
The MOV
instruction on x86 is a bit mess itself, as it is single mnemonics "MOV" used for many different instruction opcodes, but in your example only two instructions are used: MOV r/m8,imm8
with opcode C6
for storing byte values, and MOV r/m16,imm16
with opcode C7
to store word value. And in all cases that r/m
part is memory reference by absolute offset, which is calculated during compile time.
So if msg
is symbol for memory address 0x1000
, then those lines in your question compile as:
; machine code | disassembled instruction from machine code
C606001041 mov byte [0x1000],0x41
Store byte value 0x41
('A'
) into memory at address ds:0x1000
. The C6 06
is MOV [offset16],imm8
instruction opcode, the 00 10
bytes are 0x1000
offset16 itself (little endian) and finally the 41
is the imm8 value 0x41
. Segment ds
will be used to calculate full physical memory address by default, because there's no segment override prefix ahead of that instruction.
C606011052 mov byte [0x1001],0x52
C606021049 mov byte [0x1002],0x49
C606031046 mov byte [0x1003],0x46
C70604102E00 mov word [0x1004],0x2e
C606051024 mov byte [0x1005],0x24
Remaining lines are the same story, writing byte values at specific memory offsets, going byte by byte in memory, overwriting every one of them.
With the subtle difference of mov word ptr [msg + 4], '.'
, which does target memory address ds:0x1004
similarly like other lines, but the value stored is imm16
, i.e. word
value, equal to 0x002E
('.'
), so the different opcode C7
is used, and the immediate value needs two bytes 2E 00
. This one will overwrite memory at address ds:0x1004
with byte 2E
, and ds:0x1005
with byte 00
.
So if the memory at address msg
(ds:0x1000
in my examples) was at the beginning:
0x1000: 0A 0D 5A 61 68 69 64 2E 20 24 ; "\n\rZahid. $"
It will change to this after each MOV
executed:
0x1000: 41 0D 5A 61 68 69 64 2E 20 24 ; "A\rZahid. $"
0x1000: 41 52 5A 61 68 69 64 2E 20 24 ; "ARZahid. $"
0x1000: 41 52 49 61 68 69 64 2E 20 24 ; "ARIahid. $"
0x1000: 41 52 49 46 68 69 64 2E 20 24 ; "ARIFhid. $"
0x1000: 41 52 49 46 2E 00 64 2E 20 24 ; "ARIF.\0d. $"
That word did overwrite two bytes, both 'h'
(with dot) and 'i'
(with zero).
0x1000: 41 52 49 46 2E 24 64 2E 20 24 ; "ARIF.$d. $"
And that zero is overwritten one more time to dollar sign (string terminator for the DOS int 21h
service ah=9
).
Generally the relaxed syntax is not a problem, because you can't build your own instruction, the assembler will guess which one of the existing ones fits, and compile whatever expression you have into it. There's no instruction on x86 like mov [address1] and [address2], value
storing same value at two different memory locations, or mov [address]+2
which would add two to the memory value at address
(that's possible to do with add [address], 2
which is one off the add r/m,imm
variants, depending on the data size).
So mov msg+1,...
can be only memory address msg + 1
, there's no other meaningful possibility in x86 instruction set. And the data size byte
is deducted from the db
directive used after label msg:
, this is speciality of MASM and emu8086 assemblers, most of the other assemblers don't link any defined label (symbol) with directive used after it, i.e. no "types" of symbols in common assemblers. For those the mov msg+1,'R'
may end with syntax error, but not because the left side is problematic, but they will not know how big the 'R'
value should be (how many bytes).
My personal favourite NASM would report another error on it, as it requires the brackets around memory access, so in NASM only mov [msg+2],...
would be valid (with size modifier like "byte ptr" in MASM allowed, but without "ptr": mov byte [msg+2],...
. But in MASM/emu8086 all the variants you used are valid syntax with same meaning, producing memory reference by 16b offset.
The assembler will also not produce two instructions instead of single (exception may be special "pseudo-instructions" in some assemblers, which are compiled to several native instructions, but that is not common in x86 assembly).
Once you know the target CPU instruction set, what instructions do exist, you will be able to guess from the vague syntax easily, which target instruction will be produced.
Or you can easily check in debugger disassembly window, as the disassembler will use only single way of syntax for particular instruction, not aware of the source formatting ambiguities.
mov word ptr [msg + 4], '.' ; Confused again as writing on word memory specified by msg+4, '.' Will it write ascii on the two bytes-> .
It will write on two bytes, that's what WORD PTR
in MASM specifies. But the value is only '.' = 0x2E
. But 0x2E
is perfectly valid even as 16 bit value, simply extended with zeroes to 0x002E
, and that's the value used by the assembler for this line.
In future, if you are not sure, how particular thing assembles, and what it will do to the CPU/memory state, just use the emu8086 debugger. If you would in this case, you would see in the disassembly window that all those variants of msg+x
did compile to the memory addresses going byte by byte over original msg
memory. Also if you would open some memory view (I hope emu8086 has one, I don't use it) at msg
address, you could watch each write to memory, how it does change the original values, and how that WORD PTR
works, as you were not sure. Watching in debugger is usually lot more easier than reading these long answers on stack overflow...
About what PTR does: In assembly, what does `PTR` stand for? ... doesn't explain it well, as it's hard to explain it well, the whole "BYTE PTR" is the term used by MASM, it's not parsing it as BYTE and then doing something PTR to the BYTE, but it will parse it as "BYTE PTR" and be like "okay, he want to address byte". It's like single keyword, but with space.