On a multiple issue CPU example in the textbook, why does the instruction after the branch instruction have to wait for one cycle before issuing?

It is about an example in the section 3.8 Exploiting ILP using Dynamic Scheduling, Multiple Issue, and Speculation of Computer Architecture - A Quantitative Approach.

Given a dynamic scheduling, two-issue processor and the assembly code listed as following (it essentially increments each element of an array)

Loop: LD      F2,0(R1)
      DADDIU  R2,R2,#1
      SD      R2,0(R1)
      DADDIU  R1,R1,#8
      BNE     R2,R3,LOOP

then the book shows the time of issue, execution and writing result in the Figure 3.19:

My question is: why the LD R2,0(R1) of the iteration 2 is issued at the forth cycle instead of the same cycle with BNE? I am able to understand why LD should be executed later but I have no idea why the issuing should be postponed as well.

Follow up question: How is it implemented (detect a BNE instruction then postpone the next instruction) given that the two instructions are able to be issued at the same cycle? Maybe the processor finds an incoming BNE instruction at the first half cycle then it chooses not to issue the next instruction at the second half cycle? It is just my guess. No related information I found in the book.

Solution

The caption on the figure already attempts to explain it: the fetch/decode branch handling doesn't have a way to fetch the branch target in the same cycle it fetches the branch itself.

You could have a CPU with a wider fetch / decode stage, and buffering between stages to absorb bubbles from taken branches, but this CPU doesn't have that.

(But another problem is that even if you could issue both instructions, this CPU doesn't do speculative execution; it has no mechanism to discard the ld if branch prediction that the bne was taken turned out to be wrong. So it can't send the ld to execution units until after verifying (with an execution unit) that the bne is taken.)

re: implementation: the decoders are parallel, not first and 2nd half-cycle.

The 2nd decoder already has to check for hazards like a data dependency between the 2 instructions and turn the 2nd instruction into a NOP.

I'd guess that if the first instruction is a branch, it muxes the 2nd instruction slot to a NOP instead of whatever was decoded. No need for that to be synchronous and happen at a "half cycle" boundary