How many 32-bit integer ops can a Haswell core perform at once?

In the context of preparing some presentation, it occurred to me that I don't know what the theoretical limit is for the number of integer operations a Haswell core can perform at once.

I used to naively assume "Intel cores have HT, but that's probably parallelizing different kinds of work, so probably a core maxes out its parallelism with 256-bit AVX operations, so 8 integer ops which can be issued per clock cycle (and assuming nice pipelining, 8 completing as well)." - so 8 ops/cycle.

But then I noticed this article, which tells me Haswells (and Sandy Bridges) have at 3 dispatch ports which can feed vector units. So is the true figure 24 integer ops/cycle?

PS - I realize that in practice you might need to actually read all of that data from memory and its bandwidth would be the limiting factor. Or it'll be QPI that's too slow.

Solution

The theoretical maximum is 25 32 bit integer ops per cycle:

Port 0: 1 scalar op or 1 vector shift-by-constant or bitwise boolean op
Port 1: 1 scalar op or 1 vector add/sub/min/max or cmp or bitwise boolean op
Port 5: 1 scalar op or 1 vector add/sub/min/max or cmp or bitwise boolean op
Port 6: 1 scalar op (or 2, if you count SWAR with a 64bit integer register).

Since vector ops can do 8 32 bit operations, there is a maximum of 25 integer operations per cycle - 8 each for ports 0, 1, and 5 and 1 for port 6. Or 26 when SIMD-within-a-register on p6 is viable. (See Paul Clayton's comment.)

If we're just talking about "normal" integer stuff (add/multiply/bitwise/shift), then we have to exclude do 32bit multiplies (other than by power-of-2 constants) if we want to achieve 25 ops per clock. Real integer code will often be able to keep p0 busy with multiplies, PSADBW, shifts, and booleans, and will almost always have a significant amount of shuffling (p5). We're artificially excluding things that aren't strictly eight 32bit ops per clock throughput, like multiplies, variable-count shifts, and data movement between integer and vector registers. (MOVD / MOVQ).

Vector multiplies run on p0, but VPMULLD (eight 32x32 -> 32b multiplies) only runs at one per 2 cycles, since it takes 2 dependent uops (10c latency). See http://agner.org/optimize/ for instruction uop/port/throughput/latency tables.

Sustaining this throughput in the frontend will require the loop buffer, so keep the loop smaller than 28 uops (or 56 without hyperthreading). This includes the compare-and-branch loop overhead, so the theoretical throughput is actually slightly below 25. macro-fused compare-and-branch runs on p6, though, so it only displaces every 7th scalar op, making the sustainable throughput something like 24.85 ops per clock. (Or 25.85 with SWAR).

Another source describing Haswell's microarchitecture.