Search code examples
assemblymipspipelinecpu-architecture

MIPS pipeline stages - what happens when an instruction doesn't need a stage, like MEM for ALU instructions?


I understands that there are five stages -> IF, ID, EX, MEM, WB. and that the clock cycle is determined by the longest stage. what I don't understand is what happens when there's an instruction that isn't using all of the stages, lets say for example add instruction that doesn't need the MEM stage, and lets say that the clock cycle is 200ps, so it means that for an instruction that uses all the stages, it takes 1000ps to perform. will it take the same 1000ps to perform the instruction that doesn't using the MEM stege (which means that there are 200ps wasted) ? Thanks!


Solution

  • If an instruction doesn't need the MEM stage it won't drive any memory related signal in that stage but it still need to go through it.
    It is a waste of time but still an improvement over non pipelined processing. But the IF/ID stages are only 1 instruction wide in the pipeline we're considering, so it's only costing latency (for earlier forwarding), not throughput. But bypass forwarding solves that problem, with data able to forward to later instructions before it gets to WB.


    One idea for making the classic MIPS 5-stage machine skip MEM on demand is to add a datapath from EX to WB and add some logic.
    If an R-type instruction comes after a load1, a conflict would arise:

    IF ID EX MEM WB
       IF ID EX  WB <-- Conflict: two instructions in WB
    

    The CPU could send the output of EX to both MEM and WB, plus the MEM stage would mask off the datapath from EX to WB when that stage is needed and would mask off the MEM - WB datapath when it is not needed.
    This way when there is already an instruction in MEM, the next instruction in EX will go in MEM (and not WB) in the next cycle:

    IF ID EX MEM  <-- Here EX-WB is masked (since MEM is used) and MEM-WB is allowed
       IF ID EX   <-- Can go to both MEM and WB but EX-WB is masked off
    
    
    IF ID EX MEM WB  <-- So this instruction's next stage is WB (as usual)
       IF ID EX  MEM <-- This goes to MEM instead, so the pipe keeps flowing
    

    If the previous instruction didn't need MEM, one stage can be skipped:

    IF ID EX  <-- Here EX-WB is allowed (assume no prev instructions) and MEM-WB is not since (MEM was not used)
       IF ID 
    
    IF ID EX WB <-- Instruction skips MEM stage since EX-WB was allowed
       IF ID EX <-- Next instruction, again, EX-WB is allowed since MEM was not used
    
    IF ID EX WB <-- Done
       IF ID EX  WB <-- Stage MEM skipped
    

    Footnote 1: Stores don't write any registers (except the program counter update), so could maybe skip WB if you don't need to update anything for interrupt handling or deallocate any resources. In a CPU with a store buffer, its entries would normally deallocate after commit to L1d cache, which isn't tied to one of the pipeline stages. But loads do write registers, so can't skip WB and would create write-back conflicts.


    ADDENDUM

    It's worth noting that in case of a conflict for the WB stack link in the very first example, it's better to stall the the whole pipeline for 1 cycle otherwise the conflict will never resolve and all the subsequent instructions will go through the MEM stage regardless their type.

    With no stall:

    mem = Useless MEM stage but necessary to avoid a WB confict
    MEM = Instruction uses the MEM stage
    
    IF ID EX MEM WB
       IF ID EX  mem WB
          IF ID  EX  mem Wb
    

    If we introduce a stall of one cycle we resolve the conflict:

    Lowercase names means stalled cycles
    
    IF ID EX MEM WB
       IF ID EX  ex  WB
          IF ID  id  EX WB
             IF  if  ID EX WB
                     IF ID EX WB
                        IF ID EX WB
    

    Notice how from the throughput point of view this optimisation don't really bring anything useful in.
    The pipeline stabilises to a shorter length but if you compare this diagram with one where MEM is mandatory you get all the WB stages in the same cycles!
    If A depends on B then A needs to wait for B to get to its WB stage (or EX if there is forwarding in the pipeline) and since the positions of the WB (or EX) stages is the same with or without this optimisation, it is not directly observable to the software (i.e. it has no benefit).

    A shorter pipeline however consumes less energy and it's faster to refill after a flush but to really exploit the ability to skip a stage one need a superscalar CPU (that has more than one execution unit, so that EX and MEM can overlap).