assembly compiler-construction relative-addressing

PC-relative addressing on an assembly-like language compiler

I'm currently writing a compiler for a custom asm-like programming language and I'm really confused on how to do proper PC-relative addressing for data labels.

main    LDA RA hello
        IPT #32
        HLT

hello   .STR "Hello, world!"

The pseudo-code above, after compilation, results in the following hex:

31 80 F0 20 F0 0C 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 00

3180, F020 and F00C are the LDA, IPT and HLT instructions.

As seen in the code, the LDA instruction uses the label hello as an argument. Which, when compiled, becomes the value 02, which means "Incremented PC + 0x02" (if you look at the code, that's the location of the "Hello, world!" line, relative to the LDA call. The thing is: .STR is not an instruction, as it only tells the compiler it needs to add a (0-terminated) string at the end of the executable, so, were there other instructions after the hello label declaration, that offset would be wrong.

But I can't find a way to calculate the right offset, other than having the compiler being able to travel through time. Do I have to "compile" it two times? First for the data labels, then for the actual instructions?

Solution

Yes, most assemblers are (at least) two-pass - precisely because of forward references like these. Adding macro capabilities can add more passes.

Look at an assembly listing, not just the op-codes. As you said the actual offset is "2", I'm assuming memory is word-addressed.

0000 3180   main    LDA RA hello
0001 F020           IPT #32
0002 F00C           HLT

0003 4865   hello   .STR "Hello, world!"

The first two columns are the PC and opcode. I'm not sure how the LDA instruction has been encoded (where is the +2 offset in there?)

In the first pass, assuming all addressing is relative, the assmebler would emit the fixed part of the op-code (covering the LDA RA part) along with a marker to show it needed to patch up the instruction with the address of hello in the second pass.

At this point it knows the size, but not the complete value, of the final machine language.

It then continues on, working out the address of each instruction and building its symbol table.

In the second pass, now knowing the above information, it patches each instruction by calculating relative offsets etc. It also often regenerates the entire output (including PC values).

Occasionally, something will be detected in the second pass which prevents it continuing. For example, perhaps you can only reference objects within 256 words (-127 thru +128), but the label hello turns out to be more than 128 words away. This means it should have used a two-word instruction (with an absolute address), which changes everything it learnt during the first pass.

This is often referred to as a 'fix up' error. The same thing can happen during the link phase.

Single pass assemblers are only possible if you insist on 'define before use'. In which case, your code would report hello as an undefined symbol.

You also need to read up on "program sections". Whilst .STR is not an executable instruction, it is a directive to the assembler to place the binary representation of the string into the CODE section of the image (vs DATA).