assembly optimization mips pipeline micro-optimization

How can I rearrange MIPS code to minimise the number of NOPs needed, by hand?

I came across the following MIPS code which has data and load use hazards.

0 addi $t0,$a0,4
1 addi $t1,$a1,4
2 sub  $t2,$t0,$t1                #data hazard $t0, data hazard $t1
3 sll  $t3,$a2,2  
4 add  $t4,$t0,$t3                #data hazard $t3
5 add  $t5,$t1,$t3                #data hazard $t3
6 sw   $t2,0($t4)                 #data hazard $t4

To resolve the hazards I could add 5 NOPs (2 after line 1, 2 after line 3 and 1 after line 5), or I could rewrite the code thus avoiding the hazards altogether. Rearranging the code, in order to minimise the number of NOPs, I get:

0 addi $t0,$a0,4
1 addi $t1,$a1,4
3 sll  $t3,$a2,2
2 sub  $t2,$t0,$t1                #data hazard $t1
4 add  $t4,$t0,$t3                
5 add  $t5,$t1,$t3                
6 sw   $t2,0($t4)                 #data hazard $t4

The two hazards would then be resolved by adding 1 NOP after line 3 and 1 NOP after line 5. However, this code is relatively simple and short. What if I am given 20 lines of MIPS code in the exam? Is there a faster way or rule that can make rearranging the the code easier and faster to do (by hand)?

Solution

For an algorithm for instruction scheduling, you first need to identify the dependency chains. (Same as you would when identifying latency critical paths for out-of-order exec.)

To schedule instructions for an in-order machine: interleave instructions from different dep chains, starting with the longest one.

When hand-tuning (or in an optimizing compiler) you can even do stuff like rearrange associative operations (like add) to create different temporaries, creating more ILP (instruction-level parallelism). e.g. turning a+b+c+d into (a+b)+(c+d). Integer math is associative; floating point math generally isn't.

Of course this is only safe if the code uses addu / addiu, not MIPS's trap-on-signed-overflow add/addi. C compilers never use trapping add/sub so they can freely optimize arithmetic with different temporaries than the C source. You should, too, unless you specifically want an instruction to possibly trap on signed overflow.

Apparently classic MIPS assemblers could rearrange your code for you to fill load and branch delay slots; this Silicon Graphics assembler manual from 1992 goes into some detail about aggressive instruction reordering by the assembler (unless you use .set noreorder and then branch-delay slots become visible.) The book See MIPS Run may also mention this.

Presumably the SGI assembler detects basic blocks in terms of labels and branch instructions, and does instruction scheduling within single blocks.

Of course good compilers for high-level languages also do instruction scheduling. (For example GCC). I don't think the GNU assembler has an optimizing reordering mode; it's designed as a back-end for a compiler that schedules instructions itself.

Unlike your toy example with multi-cycle latency before you can use an add result, real classic MIPS was a classic 5-stage RISC with bypass forwarding, and single-cycle ALU latency. First-gen had no interlocks so there was a load-delay slot of 1 cycle, and branch-delay slots remained architecturally visible. But simple ALU instructions have 1 cycle latency. So real MIPS had far fewer hazards to avoid when scheduling instructions. But even so, later MIPS revision removed the load-delay slot for better code density when the relatively primitive compilers of the day couldn't find anything to put there. (Stalling instead of needing a NOP.)

A real machine with this many delay slots (and no hardware interlocking to stall) would be very impractical for code density / L1i cache footprint, as well as having bad performance. There's a reason real-world commercial designs do bypass forwarding instead of stalling. But your example of effectively multi-cycle latency is realistic for floating-point.

(fun fact: MIPS can forward a branch target address from first half of EX to a half-cycle IF for a total of only 1 cycle of branch latency.

MIPS was stuck with branch delay slots until a major (and not backwards compatible) reorg of opcodes introduces no-delay-slot branches (MIPS32/64r6 in 2014). The instruction in the branch delay slot executes whether the branch is taken or not, so later MIPS hardware couldn't remove it the way they could for load delay slots. RISC-V avoided this mistake.