assembly architecture mips cpu-architecture pipelining

Number of stall cycles when there is only EX/MEM pipeline registers or only MEM/WB pipeline register

I am working on a problem which is related to The processor. The problem is the problem 4.12 in the book whose title is "Computer Organization and Design (6th Edition)". The problem has the assumption as following: Here is an image Figure 4.45

And the question number 3 makes me stuck on it

The answer for this problem is MEM/WB has fewer number of stall cycles (which has CPI is 1.35) than EX/MEM (which has CPI is 1.45). I am quite confused with the explanation of this answer. It states that:

With forwarding only from the EX/MEM register, EX to 1st dependences can be satisfied without stalls but any other dependences (even when together with EX to 1st) incur a one-cycle stall. With forwarding only from the MEM/WB register, EX to 2nd dependences incur no stalls. MEM to 1st dependences still incur a one-cycle stall, and EX to 1st dependences now incur one stall cycle because we must wait for the instruction to complete the MEM stage to be able to forward to the next instruction

It is correct that with only EX/MEM pipeline register, EX to 1st is solved, however, I can not prove the all of the other cases needs ONLY ONE STALL CYCLE. For example, in the case of MEM to 1st, I think it need for 2 stall cycles because the result now is produced at MEM stage, which does not has any MEM/WB pipeline register to save the result and forward it to the next instruction. Therefore, with 2 more stall cycles, it will have the correct result. And one more thing that I want to tell you is in the case of "EX to 1st and MEM to 2nd", in the answer, it also requires one more cycle to solve hazards here. It makes me so confused because it conflicts with the case "MEM to 1st" which required only 1 cycle too.

Similarly in the next case of only MEM/WB pipe line register. Do you have any proofs for the explanation of this answer? I really look forward to your answers of this question

Solution

Its not about having the pipeline register (or not, without the pipeline register it won't work at all), but rather having the bypass/forward muxing that provides data to mitigate the RAW hazard vs. causing a stall instead (which would be silly since the logic to detect the condition must be implemented in both cases or else the processor wont work).

You are right in thinking that MEM to 1st requires 2 stall cycles if there is no bypass/forward, and with a bypass/forward, only one cycle of the 2 is mitigated, the other stall cycle is still there and cannot be removed via bypass/forward. Why? Because the data needed is simply not available anywhere in the processor for the 1st after until one cycle later than needed for the 1st after's EX.

In the other cases of bypass/forward, the data is available somewhere in the processor at the right time to proceed with 1st or 2nd after — just that the data is not in the right place, so the bypass/forward fixes that. For example, in the cycle after an add instruction finishes its EX, the value of the arithmetic addition operation is somewhere in the processor but not yet in the target register. An instruction 1st after that uses that target (for another addition, say) naturally goes into the EX stage one cycle after, and thus the proper data is available timing-wise but is simply not in the right place (which would be the prior's target register), so a bypass realizes the RAW hazard and chooses the prior EX output instead of the stale value from the 1st after's ID-stage register read.

Whereas with a MEM to 1st RAW, the data won't be available anywhere in the processor until 1 cycle after the 1st after's EX stage, so one stall plus a bypass is used, and is the best that can be done.

What they are asking about is the cost of a three-way mux that can choose between bypass/forward from EX, bypass/forward from MEM, vs. a non-forward/register read from ID, in the same clock cycle.

If you want to support any of the above three items, a three-way mux (or two levels of two-way muxes) are necessary. Remove any one of the items and you can reduce muxing.

You can search for load/use RAW hazards for more detail.

In essence, let's say that a load instruction that enters IF at time t, is thus in ID at t+1, in EX at t+2, in MEM at t+3 and WB at t+4, five cycles in all.

If that load (say lw $a0, 0($a1) is immediately followed by an instruction that uses the load's target register (say addi $t0, $a0, 1), then that instruction (the addi) enters the pipeline, IF, at time t+1, and it goes into ID at t+2, and EX at t+3. At the beginning of cycle t+3 the data required for the addi to execute — the increment using the ALU in EX — is no where in the processor, and it won't arrive in the processor until the end of t+3, which is the end of the load's MEM cycle. So, even with a bypass/forward, a stall cycle is unavoidable and required. In this scenario, the stall delays the addis EX stage until t+4, a time when the data is available somewhere in the processor. At t+4 the load's data could be re-taken from the architectural register file (though that would require triple porting the read side of the register file, and would also probably increase the cycle time to accommodate that), or, be obtained from the MEM/WB pipeline register).

Since the architectural register file is read to obtain source operand values in the ID stage, without a bypass/forward, two stalls are required to delay the addis ID (which ideally would have happened at t+1) until t+4 when it can see results of the load's WB stage in the architectural register file.