Search code examples
compilationcompiler-constructioncode-generation

How Does Assembly Code Generation Work?


I've been studying compiler design a lot recently. I've managed to get a strong grasp of the parsing stage, but am having a bit of trouble understanding how code generation works.

From what I've read, there seems to be 3 major steps in the code generation phase:

  • Instruction Selection (Greedy Tiling)
  • Instruction Scheduling
  • Register Allocation

Now, instruction scheduling is a little beyond what I'm trying to do at the moment, and I think with a bit more studying and prototyping, I can probably wrap my mind around the graph coloring algorithm for register allocation.

What stumps me is the first step, instruction selection. From what I've read about it, each instruction in a target machine language is represented by a tile; and the goal is to find the instructions that match the largest parts of the tree (hence the nickname, greedy tiling).

The thing I'm confused about is, how do you select instructions when they don't actually correspond 1:1 with the syntax tree?

Take for example, accumulator-based architectures like the Z80 or MIPs single instruction architecture. Performing even 16-bit integer arithmetic on a Z80 may require the use of the accumulator or shadow registers.

There are also some instructions that can only be used on certain registers despite them being general purpose.

Would I be right in assuming the following?

a) A tile may consist of a sequence of instructions that match a syntax tree pattern, rather than just a 1:1 match.

b) The code generator generates code for a stack-based architecture (or an architecture with infinite temporary registers) first and expands and substitutes instructions as necessary somehow during the register allocation phase.


Solution

  • a) A tile can emit arbitrary number of instructions. For example, if you have instruction like %x <- %y + %z, but the target machine has only two-address instructions, then a matching tile might emit the assembly sequence (destination is first operand)

    mov %x, %y
    add %x, %z
    

    b) what kind of register (or const, or mem reference) is allowed as an operand to an instruction is determined by the instruction itself, hence the instruction selection phase has to work on representation with symbolic register names (pseudo registers). Register allocation phase may indeed emit addition instructions, e.g. spill/load code when a register of a required class is not available for allocation.

    Check this Survey on Instruction Selection: an Extensive and Modern Literature Review