assembly compiler-construction masm x86-16 memory-segmentation

How does assembler compute segment and offset for symbol addresses?

I have learned about compilers and assembly language, so I'd like to write my own assembler as an exercise. But there I have some questions;

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

Take an easy assembly program as an example:

    .model small
    .stack 1024
    .data
          msg db 128 dup('A')
    .code
    start:
        mov ax,@data
        mov ax,ds
        mov dx, offset msg
                           ; DS:DX points at msg
        mov ah,4ch
        int 21h            ; exit program without using msg
    end

So how does the assembler calculate the segment address for the @data segment?

And how does it know what to put into the immediate for mov dx, offset msg?

Solution

The assembler doesn't know where @data and msg will end up in memory so generates metadata called relocations (or "fixups") in the object (.OBJ) file that allow the linker and operating system to fill in the correct values.

Lets take a look at what happens with a slightly different example program:

.model small
.stack 1024
.data
    msg db 'Hello, World!,'$'
.code
start:
    mov ax,SEG msg
    mov ds,ax
    mov dx,OFFSET msg
    mov ah,09h
    int 21h              ; write string in DS:DX to stdout
    mov ah,4ch
    int 21h              ; exit(AL)
end start

When assembling this file the assembler has no way knowing where the linker will put anything defined by this example program. It may appear obvious to you, but the assembler can't assume it seeing a complete program. The assembler doesn't know if you'll link it with other object files or libraries which could cause the linker to put msg somewhere other than the start of the data segment.

So when this example program gets assembled into an object file, the assembler generates two relocation records. If you use MASM to assemble the file you can see this in listing file generated with the /Fl switch:

 ; listing of the .obj assembler output, before linking
 0000               start:
 0000  B8 ---- R            mov ax,SEG msg
 0003  8E D8                mov ds,ax
 0005  BA 0000 R            mov dx,OFFSET msg
 0008  B4 09                mov ah,09h

The R next to the operand in the machine code column of the listing indicates they have relocations the refer to them. When the linker creates the MS-DOS format executable from the object file it will able to supply correct offset from the start of the data segment for msg. That value is a link-time constant so only the .obj, not the .exe, needs a relocation for it.

However the linker won't be able to supply the location of the segment of msg (the data segment) because the linker doesn't know where MS-DOS will load the executable into memory. (Unlike under a modern mainstream OS where every process has its own virtual address space, real mode has only one address space that programs have to share with device drivers and TSRs, and the OS itself.)

So the linker will put a relocation in the generated executable that tells MS-DOS to adjust the immediate operand based on where it gets loaded.

Note that you might want to simply your assembler writing exercise by writing one that only works with complete programs and generates only .COM executables. That way you don't have to worry about relocations. Your assembler will decide where everything gets placed within the single segment allowed by the .COM format. Note that because .COM files don't support segment relocations, instructions like mov ax,@data or mov ax,SEG msg can't be used. Instead, CS=DS=ES=SS on program startup, with a value chosen by the OS's program loader. (And that value isn't known at assemble time.)