performance assembly mips pipeline cpu-architecture

Why are data forwarding and stall cycles more efficient than NOPs for dealing with load-use hazards?

We can use both NOPs, data forwarding and stall cycles to resolve data and load-use hazards. However if we have multiple data hazards, then it becomes quite inefficient to resolve all of them using NOPs, as they would increase the runtime of the program. In comparison to that, if we have a load use hazard, we can use data forwarding and stall cycles to resolve the hazard and it gives a more efficient result. My question is, how is data forwarding in combination with stall cycles a more efficient way of dealing with data hazards compared to NOPs? Because when we add a stall cycle then the program has to wait a clock cycle to allow for the data forwarding (MEM to EX). Thus the clock cycle count will be increased by 1.

Solution

Data forwarding overcomes some hazards, with the recognition that the necessary value computed by a prior instruction is available sooner than when it appears back in the register. So data forwarding is always a win over stalling and NOPs.

Of course, stalling is sometimes necessary, as in the case you describe with a load-use hazard. In the small, stalling has the same effect as NOPs, however:

Code size is smaller without the NOPs. Code size has a huge effect in the instruction cache -- this affects performance and thus code size cannot be ignored.

Also, from a perspective of architecture longevity, while we may know the number of NOPs needed for some micro-architecture design, this will most likely change in future micro-architectures, so the NOPs inserted in an older program are no longer doing their job properly on the newer hardware. Thus, we conclude that is better to let the hardware stall rather than inserting NOPs.

For example, an out-of-order machine may internally rearrange instructions to cover a MEM->EX hazard (NOPs would just get in the way).