cpu pipeline cpu-architecture latency instructions

Understanding CPU pipeline stages vs. Instruction throughput

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?

Besides the obvious of "different instructions require a different amount of work to complete", hear me out...

Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.

An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).

Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.

I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.

Solution

I'm missing something fundamental re. CPU pipelines: at a basic level, why do instructions take differing numbers of clock cycles to complete and how come some instructions only take 1 cycle in a multi-stage CPU?

Because what we're interested in is in speed between instructions, not the start to end time of a single instruction.

Besides the obvious of "different instructions require a different amount of work to complete", hear me out...

Well that's the key answer to why different instructions have different latencies.

Consider an i7 with an approx 14 stage pipeline. That takes 14 clock cycles to complete a run-through. AFAIK, that should mean the entire pipeline has a latency of 14 clocks. Yet this isn't the case.

That is correct, though that's not a particularly meaningful number. For example, why do we care how long it takes before the CPU is entirely done with an instruction? That has basically no effect.

An XOR completes in 1 cycle and has a latency of 1 cycle, indicating it doesn't go through all 14 stages. BSR has a latency of 3 cycles, but a throughput of 1 per cycle. AAM has a latency of 20 cycles (more that the stage count) and a throughput of 8 (on an Ivy Bridge).

This is just a bunch of misunderstandings. An XOR introduces one cycle of latency into a dependency chain. That is, if I do 12 instructions that each modify the previous instruction's value and then add an XOR as the 13th instruction, it will take one cycle more. That's what the latency means.

Some instructions cannot be issued every clock, yet take less than 14 clocks to complete.

Right. So?

I know about the multiple execution units. I don't understand how the length of instructions in terms of latency and throughput relate to the number of pipline stages.

They don't. Why should there be any connection? Say there's 14 extra stages at the beginning of the pipeline. Why would that effect latency or throughput at all? It would just mean everything happens 14 clock cycles later, but still at the same rate. (Though likely it would impact the cost of a mispredicted branch and other things.)