c parallel-processing latency throughput loop-unrolling

Why does program achieve the throughput bound given an unrolling factor k>C*L?

I'm reading Computer Systems: A Programmer's Perspective, 3/E (CS:APP3e) Randal E. Bryant and David R. O'Hallaron.

The author says:

"In general, a program can achieve the throughput bound for an operation only when it can keep the pipelines filled for all of the functional units capable of performing that operation. For an operation with latency L and capacity C, this requires an unrolling factor k ≥ C · L. For example, floating-point multiplication has C = 2 and L = 5, necessitating an unrolling factor of k ≥ 10. Floating-point addition has C = 1 and L = 3, achieving maximum throughput with k ≥ 3."

Why does it work? Even though the throughput is 1.00, it only indicates that every cycle a new operation begins. (Latency 3.00, issue 1.00.)

I tried to sketch a diagram of fully pipelined unit, for example, a floating-point adder contains three stages (and hence the three-cycle latency), so it can start a new operation every clock cycle.

3 cycles - 1 addition but every cycle a new operations begins, after 3 cycles only the first addition is finished, after 4 cycles the second, after 5 cycles the third. So, we get 3 operations for 5 cycles for fully-pipelined instead of 3 operations for 9 cycles(not fully-pipelined). Is my understanding of a fully-pipelined unit wrong?

What is the reason for k=10? Even if there are 2 units, each iteration it must wait until the very last 2 operations are computed(i.e.10 multiplications, but the iteration must wait until the last 2 operations to finish), thus no reason for unrolling.

Maybe the program does not execute in sequential order (here I mean that the processor doesn't wait for the end of each iteration, due to the branch prediction?)

/* 2 x 2 loop unrolling */

void combine6(vec_ptr v, data_t *dest)
{
    long i;
    long length = vec_length(v);
    long limit = length-1;
    data_t *data = get_vec_start(v);
    data_t acc0 = IDENT;
    data_t acc1 = IDENT;

    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
        acc0 = acc0 OP data[i];
        acc1 = acc1 OP data[i+1];
    }

    /* Finish any remaining elements */
    for (;i < length; i++) {
        acc0 = acc0 OP data[i];
    }
    *dest = acc0 OP acc1;
}

Solution

As vector length grows, all those claims are closer and closer to the truth. Theoretically, the bounds are reached for an infinite length vector, when pipeline startup and shutdown times divided by overall processing time is zero. The authors, for the sake of simplicity, are assuming these times to be negligible compared to the overall processing time in their tests.