Search code examples
pipelinecpu-architecturepipelining

How does instruction after Load, execute?


There are 5 stages in pipeline.

IF -  Instruction fetch
ID -  Instruction Decode, read the register  
EX -  On a memeory reference add up base and offset, 
      and for arithmetic instruction do the math.   
MEM - If load or store access memory
WB -  Place the result in appropriate register.     



 I1 : R0 <- M[loc]      IF | ID | EX | MEM | WB |

 I2 - R0 <- R0 + R0                  | IF  | ID | EX | MEM | WB  | 

 I3 - R2 <- R2 - R0                        | IF | ID | EX  | MEM | WB |

Consider that "Operand Forwarding " has been used.
Solution says :-

Instruction I1 is a Load instruction. So the next instruction ( I2) cannot fetch until I1 finishes its EXE stage.

But I think: in MEM stage, the processor accesses the memory and picks the desired word. And in WB stage it updates Registry of Register.
So until MEM stage the processor holds the control of Memory, so I2 will start fetching after MEM of I1.

Which one is correct?

Description of stages has not been given, it is written as per my knowledge.


Solution

  • Convention:

    I denote a generic instruction.
    I1, I2, I3, ... denote specific instructions. S denote a generic stage of the pipeline.
    IF, ID, EX, MEM, WB denote a specific stages of the pipeline.
    I.S denote the cycle in which the instruction I was in the stage S.

    The instruction I2 needs R0 but that register won't be ready from I1 until I1.WB has finished, assuming a basic pipeline.
    With operand forwarding present, I2 can read the result while I1 is writing it back into the register file, that is during I1.WB.

    Since the operand is read in the ID stage, I2.ID and I1.WB must happen at the same time.
    That means that I2.IF must happen at the same time of I1.MEM.


    Now you are rightfully dubious about the ability of the CPU to perform two reads (one for the instruction fetching and one for the load) in the same clock.
    Very simple CPUs actually stalls on such conflict (in your example the I2.IF would happen at I1.WB).

    The simplest approach to avoid a stall is the Harvard architecture, where the CPU fetches instructions from a different memory.

    The Harvard architecture has been modified through the use of caches and pre-fetching of data and instructions.
    In this context a stall occurs only if both the load and the instruction fetching need to access the memory (and not the caches).

    Modern desktop architectures have L1 data cache that can handle more that one access at a time and the CPU is tightly coupled with them so that two or more loads/stores can be executed at the same time, in parallel with loads from L1 instruction cache.

    Finally some of the modern CPUs decode more than one instruction at a time, alleviating the problem of the stall (but not eliminating it).
    It is the cache that provide the greatest benefit in avoiding stalls though.