Understanding FMA performance

I would like to understand how to compute FMA performance. If we look into the description here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_fmadd_ps&expand=2520,2520&techs=FMA

for Skylake architecture the instruction have Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.

So as far as I understand if the max (turbo) clock frequency is 3GHz, then for a single core in one second I can execute 1 500 000 000 instructions.

Is it right? If so, what could be the reason that I am observing a slightly higher performance?

Solution

Latency=4 and Throughput(CPI)=0.5, so the overall performance of the instruction is 4*0.5 = 2 clocks per instruction.

Just working out the units gives cycles²/instr, which is strange and I have no interpretation for it.

The throughput listed here is really a reciprocal throughput, in CPI, so 0.5 cycles per instruction or 2 instructions per cycle. These numbers are related by being each others reciprocal, the latency has nothing to do with it.

There is a related calculation that does involve both latency and (reciprocal) throughput, namely the product of the latency and the throughput: 4 * 2 = 8 (in units of "number of instructions"). This is how many independent instances of the operation can be "in flight" (started but not completed) simultaneously, comparable with the bandwidth-delay product in network theory. This number informs some code design decisions, because it is a lower bound on the amount of instruction-level parallelism the code needs to expose to the CPU in order for it to fully use the computation resources.