Given a chain of instructions linked by true dependencies and repeated periodically (i.e. a loop), for example (a->b->c)->(a->b->c)->...
Assuming that it can be split into several shorter and independent sub-dependency chains to benefit from out-of-order execution :
The out-of-order engine schedules each instruction to the corresponding CPU unit which have a latency and a reciprocal throughput.
What is the optimal number of sub-dependency chains maximizing the execution throughput ?
According to Agner's manual Optimizing subroutines in assembly language, Section 12.15 : "The optimal number of accumulators if the CPU has nothing else to do is the latency of the most critical instruction in the dependency chain divided by the reciprocal throughput for that instruction". What does "most critical instruction" mean ? Is there any other technical documentation tackling this kind of problem ?
It depends how long they are, and how many uops per cycle each one can run on its own.
It also depends on how wide the hardware is. e.g.
I think "most critical instruction" means the one that makes up most of the length of the critical path. If a loop-carried dependency chain is composed of multiple instructions with different latencies, it's some sort of average. (Like maybe a geometric average?)
A good example is FP add (e.g. summing an array):
On Sandybridge, it has one-per-clock throughput but 3c latency, so a single chain of dependent addps
instructions runs at one uop per 3c, sustaining only 1/3rd of the max FP multiply throughput. (And leaving the other two execution ports totally unoccupied.)
Three parallel dep chains can keep port1 saturated with addps
instructions. So if you use three accumulators, you can keep three adds in flight. If you also keep 5 FP multiplies in flight, you can saturate port0 as well. Loop overhead can run on port5 (and hopefully not steal cycles from p01). The load uops can micro-fuse with the adds, so they don't take up fused-domain bandwidth. But you could do some of the loads with separate movaps
instructions and still not saturate the 4 per clock fused-domain uop throughput, but bottlenecks in the frontend might limit you to less throughput.
Haswell still only has one per clock throughput for FP add, but two per clock for FP mul and FMA.
So if you sum an array using FMA (with a multiplier of 1.0), you need 10 vector accumulators (10 dep chains) to keep 10 FMAs in flight, saturating p01. p5 and p6 go unused, but you can also saturate the load ports with micro-fused loads.
Skylake reduced the latency of FMA by 1 cycle, down to 4, and dropped the FP add unit. (So FP add is done in the FMA unit, which doubled the throughput for [v]addps
, at the cost of 1c more latency).
So on SKL you only need 8 vector accumulators (8 dep chains) to saturate p01. But having more dep chains doesn't hurt, as long as you don't run out of registers. So code that's ideal on Haswell, using 10 accumulators, should still be ideal on SKL. You could maybe save some power by just using addps
instead of fma213ps
(or whatever) with a constant vector of 1.0, though.
See Agner's instruction tables for throughput/latency/port numbers, and his microarch PDF for more details. I didn't check the port numbers or latency numbers, but I've typed this example so often that I'm pretty sure it's correct :P.
Also see other links in the x86 tag wiki.