I'm currently writing a compiler for a custom asm-like programming language and I'm really confused on how to do proper PC-relative addressing for data labels.
main LDA RA hello
IPT #32
HLT
hello .STR "Hello, world!"
The pseudo-code above, after compilation, results in the following hex:
31 80 F0 20 F0 0C 48 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 00
3180
, F020
and F00C
are the LDA
, IPT
and HLT
instructions.
As seen in the code, the LDA
instruction uses the label hello
as an argument. Which, when compiled, becomes the value 02
, which means "Incremented PC + 0x02" (if you look at the code, that's the location of the "Hello, world!" line, relative to the LDA
call.
The thing is: .STR
is not an instruction, as it only tells the compiler it needs to add a (0-terminated) string at the end of the executable, so, were there other instructions after the hello
label declaration, that offset would be wrong.
But I can't find a way to calculate the right offset, other than having the compiler being able to travel through time. Do I have to "compile" it two times? First for the data labels, then for the actual instructions?
Yes, most assemblers are (at least) two-pass - precisely because of forward references like these. Adding macro capabilities can add more passes.
Look at an assembly listing, not just the op-codes. As you said the actual offset is "2", I'm assuming memory is word-addressed.
0000 3180 main LDA RA hello
0001 F020 IPT #32
0002 F00C HLT
0003 4865 hello .STR "Hello, world!"
The first two columns are the PC and opcode. I'm not sure how the LDA
instruction has been encoded (where is the +2
offset in there?)
In the first pass, assuming all addressing is relative, the assmebler would emit the fixed part of the op-code (covering the LDA RA
part) along with a marker to show it needed to patch up the instruction with the address of hello
in the second pass.
At this point it knows the size, but not the complete value, of the final machine language.
It then continues on, working out the address of each instruction and building its symbol table.
In the second pass, now knowing the above information, it patches each instruction by calculating relative offsets etc. It also often regenerates the entire output (including PC values).
Occasionally, something will be detected in the second pass which prevents it continuing. For example, perhaps you can only reference objects within 256 words (-127 thru +128), but the label hello
turns out to be more than 128 words away. This means it should have used a two-word instruction (with an absolute address), which changes everything it learnt during the first pass.
This is often referred to as a 'fix up' error. The same thing can happen during the link phase.
Single pass assemblers are only possible if you insist on 'define before use'. In which case, your code would report hello
as an undefined symbol.
You also need to read up on "program sections". Whilst .STR
is not an executable instruction, it is a directive to the assembler to place the binary representation of the string into the CODE section of the image (vs DATA).