Throughput FMA and multiplication on X86 Broadwell

I am suspecting last Intel architecture to perform the mnemonic MUL like a FMA but with a null addition (on broadWell architecture).

In details, I am currently performing product of Quatric polynomials (Pi), following the pattern.

P1*P2*P3*P4

Every polynomial Pi(x) = a + bX +cX^2 is evaluated by two successives FMA. However, when I measure the throughput of my problem, the numbers are very low. Following the Table of Agner Fog Agner Fog page 242, the throughput of a FMA and MUL is 0.5. The definition of the throughput: is the time in [cycle] to perform a new identical mnemonic.

So I should get a penalty between the FMA and the MUL, however my measurement is smooth. I suspect the processor under the hood swap the MUL by a FMA with a null addition, or at least use an identical part of the circuit in the FPU, which explain my results.

I may be completely wrong, but if a hardware engineer could confirm or infirm.

Solution

So I should get a penalty between the FMA and the MUL

Yes, from Agner Fog's tables you should look at which execution ports an instruction runs on. That's usually all you need to work out throughput for a sequence of different instructions. (On modern mainstream x86 CPUs like Broadwell, all execution units other than div/sqrt are fully pipelined (can start a new uop every clock cycle), so only some weird microcoded instructions like loop have less throughput than you'd expect from looking at their uops / ports.)

The actual "throughput" numbers in Agner's tables are mainly useful as a summary or indication of any weirdness, and not usually directly useful especially for efficient single-uop instructions like vmulps or vfma...ps. See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? for some more details on how to predict performance for a block of multiple instructions in terms of latency, back-end port bottlenecks, and front-end uop throughput bottlenecks.

however my measurement is smooth. I suspect the processor under the hood swap the MUL by a FMA with a null addition, or at least use an identical part of the circuit in the FPU, which explain my results.

Huh, I don't understand. You just said you thought MUL and FMA should conflict with each other, but now you're saying that you think running a MUL on the FMA unit explains something??

I am suspecting last Intel architecture to perform the mnemonic MUL like a FMA but with a null addition (on broadWell architecture).

Almost every FP operation that has to normalize an FP result (except FP add) runs on the FMA unit in Broadwell. But mul and add have 3 cycle latency on Broadwell, while actual FMA has 5 cycle latency, so clearly there are different configurations for the FMA unit. MUL/FMA are identical for throughput, but not for latency on Broadwell.

(Unlike Skylake where the separate add unit was dropped, and mul/add both have the exact same 4c latency / 0.5c throughput as FMA).

Having MUL with different latency than FMA in Broadwell is unusual; most CPUs that have both run them with identical performance, presumably just feeding a 0.0 into the add input, or something equivalent.

SIMD Integer multiply also uses the multipliers in the FMA unit, and so does integer shift. A surprising amount of stuff uses it, but it makes sense especially in Skylake-X that they'd take advantage of those transistors for as much as possible, instead of having more 512-bit-wide SIMD execution units.

I am currently performing product of Quatric polynomials (Pi), following the pattern. P1*P2*P3*P4

What are you doing with the results? Are you only doing groups of 4? What do you do with the result of each group?

Or are you multiplying many qadratic polynomials in one huge chain of multiplies, creating a dependency chain of mulps?

That would bottleneck you at 3 cycles per polynomial, with independent calculations of each polynomial (2x FMA) to create inputs for that mulps happening in parallel. In that case Broadwell is your ideal CPU for that, with 3 cycle mulps vs. 5 cycles in Haswell and 4 cycles Skylake.

But if you can pretend FP math is associative and have different temporary results, you can run 2, 3, or 4 chains of multiplies (or even more) and combine at the end, using an unrolled loop with multiple vectors. e.g. (P1*P3*P5*... ) * (P2*P4*P6*...), with that final multiply outside the loop as part of the cleanup.

See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about unrolling with multiple accumulators to hide FP latency.