I have learned about compilers and assembly language, so I'd like to write my own assembler as an exercise. But there I have some questions;
How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?
Take an easy assembly program as an example:
.model small
.stack 1024
.data
msg db 128 dup('A')
.code
start:
mov ax,@data
mov ax,ds
mov dx, offset msg
; DS:DX points at msg
mov ah,4ch
int 21h ; exit program without using msg
end
So how does the assembler calculate the segment address for the @data
segment?
And how does it know what to put into the immediate for mov dx, offset msg
?
The assembler doesn't know where @data
and msg
will end up in memory so generates metadata called relocations (or "fixups") in the object (.OBJ) file that allow the linker and operating system to fill in the correct values.
Lets take a look at what happens with a slightly different example program:
.model small
.stack 1024
.data
msg db 'Hello, World!,'$'
.code
start:
mov ax,SEG msg
mov ds,ax
mov dx,OFFSET msg
mov ah,09h
int 21h ; write string in DS:DX to stdout
mov ah,4ch
int 21h ; exit(AL)
end start
When assembling this file the assembler has no way knowing where the linker will put anything defined by this example program. It may appear obvious to you, but the assembler can't assume it seeing a complete program. The assembler doesn't know if you'll link it with other object files or libraries which could cause the linker to put msg
somewhere other than the start of the data segment.
So when this example program gets assembled into an object file, the assembler generates two relocation records. If you use MASM to assemble the file you can see this in listing file generated with the /Fl switch:
; listing of the .obj assembler output, before linking
0000 start:
0000 B8 ---- R mov ax,SEG msg
0003 8E D8 mov ds,ax
0005 BA 0000 R mov dx,OFFSET msg
0008 B4 09 mov ah,09h
The R
next to the operand in the machine code column of the listing indicates they have relocations the refer to them. When the linker creates the MS-DOS format executable from the object file it will able to supply correct offset from the start of the data segment for msg
. That value is a link-time constant so only the .obj
, not the .exe
, needs a relocation for it.
However the linker won't be able to supply the location of the segment of msg
(the data segment) because the linker doesn't know where MS-DOS will load the executable into memory. (Unlike under a modern mainstream OS where every process has its own virtual address space, real mode has only one address space that programs have to share with device drivers and TSRs, and the OS itself.)
So the linker will put a relocation in the generated executable that tells MS-DOS to adjust the immediate operand based on where it gets loaded.
Note that you might want to simply your assembler writing exercise by writing one that only works with complete programs and generates only .COM executables. That way you don't have to worry about relocations. Your assembler will decide where everything gets placed within the single segment allowed by the .COM format. Note that because .COM files don't support segment relocations, instructions like mov ax,@data
or mov ax,SEG msg
can't be used. Instead, CS=DS=ES=SS on program startup, with a value chosen by the OS's program loader. (And that value isn't known at assemble time.)