I would like to understand how to compute FMA performance. If we look into the description here:
for Skylake architecture the instruction have Latency=4
and Throughput(CPI)=0.5
, so the overall performance of the instruction is 4*0.5 = 2
clocks per instruction.
So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.
Is it right? If so, what could be the reason that I am observing a slightly higher performance?
Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.
Just working out the units gives cycles²/instr
, which is strange and I have no interpretation for it.
The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.
There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.