AV512: Best way to combine horizontal sum and broadcast

There is already a question about horizontal sums using AVX512. I'm trying to do something similar, but after the sum, I would like to broadcast the result to all 8 elements in a __m512d variable. So far, I have tried:

Using the intel provided macros:

double sum = _mm512_reduce_add_pd( mvx );
sumx = _mm512_set1_pd( sum );

Using shuffle/permute, trying to avoid lane crossings as much as possible:

sumx = mvx;

mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_shuffle_f64x2(mvx,mvx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_permutex_pd(mvx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);

Using the hint by @PeterCordes, reducing the add/shuffles to 3:

sumx = mvx;

mvx = _mm512_shuffle_pd(mvx, mvx, 0b01010101);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_permutex_pd(sumx, _MM_PERM_ABCD);
sumx = _mm512_add_pd(mvx, sumx);

mvx = _mm512_shuffle_f64x2(sumx,sumx, _MM_SHUFFLE(1,0,3,2));
sumx = _mm512_add_pd(mvx, sumx);

In each case mvx is the __m512d input and sumx is the __m512d output.

I'm benchmarking it on an Intel Skylake CPU using the intel compiler:

Version 1: 2.17s
Version 2: 2.31s
Version 3: 1.96s

Is this the best I can do or do you see another way to optimize this operation?

Solution

Generally the best way is to swap halves instead of narrowing, so the same sum gets computed in both halves. (Especially if you don't care about Zen 4 or hypothetical future CPUs where there's a throughput advantage to narrowing to 256-bit.) It should only take 3 shuffle/add steps to handle 2^3 = 8 doubles in a __m512d, one of them being in-lane adding pairs.

Your second version is correctly doing that, and looks optimal on current CPUs (Intel and Zen 4.)

Doing the lower-latency in-lane shuffle first like you're doing is good for out-of-order exec, letting more uops execute and retire a couple cycles sooner to make room in the scheduler and ROB for new maybe-independent work sooner.

On current Intel CPUs, all 512-bit lane-crossing shuffles of 32-bit granularity or wider have the same performance: 1 uop for port 5 with 3c latency. And in-lane 512-bit shuffles are 1 uop for port 5 with 1c latency.

On Zen 4, vshufpd and vpermilpd both have the same timings as each other, and so do vpermpd / vshuff64x2 / vshuff32x4 / valignq. (https://uops.info/) All these shuffles have immediate operands for their controls, so the compiler doesn't have to load a vector constant.

Any tweaks would just be based on guess-work about what might be faster on possible future CPUs, like a future Intel E-core with AVX-512 support, or a stripped-down AMD Zen 4 that they use as an E-core in their future CPUs, if they change the execution units at all instead of just cache. Or future big cores which might have room for multiple 512-bit shuffle units. This code has a serial dependency through all 6 operations, but being able to run on more ports might let out-of-order execution do a better job of running this and some independent surrounding code at the same time, or another logical core.

Using the widest granularity shuffle available has historically been best. e.g. vextractf64x4 ymm, zmm, imm is faster on Zen 4 than vextractf64x2 xmm, zmm, imm, so prefer the former for extracting the third 128-bit chunk even if you don't mind bringing high garbage with it. Fewer larger chunks means fewer possible arrangements, shorter chains of multiplexing and thus might have lower latency or run on more execution units. But there is no vshuff64x4, only vshuff64x2 128-bit chunks, so that's our only good option for swapping 256-bit halves.

If tuning specifically for Zen 4 without caring about Intel, vextractf64x4 + vinsertf64x4 is lower total latency than vshuff64x2 zmm, although it costs 2 uops instead of 1 for the front-end. Other than insert/extract of 256-bit halves, 512-bit shuffles on Zen 4 occupy their execution unit for 2 cycles (actual throughput of 1/clock is half what you'd expect from being one uop for either of two ports, like how Zen 4 handles other 512-bit uops that don't need to move data between halves).

For the middle shuffle, swapping 128-bit halves, the choice is between vshuff64x2 z,z,z,imm8 and vpermpd z,z,imm8. Both run identically on current CPUs including Zen 4. We might choose vshuff64x2 based on the wider granularity (moving data around in 128-bit chunks instead of 64-bit), but there's another factor to consider: vpermpd z,z,imm8 does independent 256-bit shuffles in each half, so decomposes trivially for 256-bit execution units. (Unlike with the vector control version that has eight 3-bit indices to pick from across the whole vector.)

Zen 4 has shuffle execution units that are essentially 512-bit wide so they just cost some throughput and higher latency for 512-bit ops. But possible future Intel E-cores with AVX-512 might not do that, and might run vshuff64x2 z,z,z,imm8 as slowly as Zen 1 ran vperm2f128 y,y,y,imm8 (8 uops although that seems excessive) or vpermps y,y,y (3 uops). Intel E-cores in Alder Lake manage to handle those as 2 uops each, so presumably an E-core supporting AVX-512 with 256-bit execution units could also handle vshuff64x2 z,z,z,imm8 as 2 uops.

vshuff64x2 z,z,z,imm8 takes 1024 bits of input (since it can take two different input vectors), but the first 2 output lanes are selected from the first input (so only 4 possible input bits have to route through muxes to each output bit), and same for the second two output lanes coming from the second source. So it could be decomposed to two separate 512-bit-input / 256-bit-output shuffles, like 256-bit valignq ymm,ymm,ymm, imm or like vperm2f128 ymm, ymm, ymm, imm but with each output lane being able to select any of the four. (valignq zmm is actually another possibility for the final shuffle, but less likely to be cheap.)

So vshuff64x2 zmm is actually designed in a way that probably makes it cheaper to implement with narrower execution units than you might think, much easier than valignq or vpermt2ps or other 2-input shuffles where each output can pick from anywhere in both 512-bit inputs.

One might guess that a one-input shuffle _mm512_permute_pd(mvx, 0b01'01'01'01); (aka vpermilpd z,z, imm) might be more efficient on some future CPU than your vshufpd z,z,z, imm with the same input twice. That's actually true on Knight's Landing (Xeon Phi), but I assume you don't care about that since it's been discontinued for a few years, and I didn't look at timings of vpermpd vs. vshuff64x2 on it.

But on Ice Lake, the more common vshufpd y,y,y,i has 2/clock throughput vs. 1/clock vpermilpd y,y,i¹. So who can guess which shuffles will be faster on future E-cores with AVX-512 or future big cores where there might be room for multiple 512-bit shuffle units.

Summary:

vshufpd is fine for the first shuffle. Even if the vector started in memory, you wouldn't want a memory-source vpermilpd since you need another copy of the vector as an input for vaddpd. Could go either way on future E-cores handling one or the other more cheaply. It's an in-lane shuffle so it decompose to multiple narrower shuffles trivially for E-cores.
vpermpd-immediate is a good choice for the middle shuffle (swapping 128-bit pairs); it's likely that future E-cores can handle it efficiently (as two independent 256-bit halves). vshuff64x2 can decompose into two separate 512-bit input / 256-bit output shuffles, though, so it's not bad either.

vpermpd with a vector control operand doesn't decompose as easily, but it's a different opcode so hopefully the immediate control version would still be cheap even if the vector control version is slower. And somehow Alder Lake E-cores do manage to run vpermps ymm as 2 uops.
vshuff64x2 or valignq are equally good for swapping 256-bit halves on Intel CPUs, and equal to each other on Zen 4. vshuff64x2 is clearly easier for E-cores to implement efficiently: both have the same amount of input (1024 bits), but vshuff64x2 has significantly fewer possible sources for any given bit of output (4 vs. 16, and with more restrictions on which source feeds which output if the two sources aren't the same register). Also, it's probably a more commonly-used shuffle so architects are more likely to spend transistors to make it not too slow.

vextractf64x4 + vinsertf64x4 would be lower latency on Zen 4, which might or might not matter depending on surrounding code. But vshuff64x2 zmm is still single-uop on Zen 4 with only 4-cycle latency, like other 512-bit lane-crossing shuffles. Hypothetical smaller cores with AVX-512 might run it as 2 or more.

Footnote 1: IDK why Ice Lake / Alder Lake can't just decode vpermilpd with a register source and immediate control into a vshufpd uop that reads the same input twice, since the same immediate bits will produce the same shuffle in that case. Seems like a missed optimization, although maybe it would have a cost somewhere in the decoders for producing a uop with 1 input for the memory source version vs. 2 inputs for a register source version. So instead, change the shuffle execution unit to replicate one input in that case, as a way to have port 1 handle vpermilpd uops, making it not special to handle memory sources this way. At a cost of having to handle more different control inputs on the port 1 input of the shuffle unit?

On Ice Lake / Alder Lake, the port 1 execution unit that can handle some but not all 128-bit and 256-bit shuffles when there are no 512-bit uops in flight is probably just half of the 512-bit shuffle execution unit that's normally accessible from port 5. (Same way they handle 256-bit FP math instructions on port 0 or 1, but have it work as a single 512-bit FMA unit when port 1 is shut down.) So the lanes of the shuffle unit can handle vpermilpd when it's the upper half of a vpermilpd zmm, zmm, imm8 on port 5, so it seems like it would require minimal extra logic to be able to do the same when accessed via port 1. (vpermilpd zmm and vshufpd zmm use the upper 4 bits of their immediates the same way as each other, and the same as the low 4 bits works for the low half. Each 128-bit lane has 2 bits of control input.)

I wonder if it's intentional to make sure vpermilpd/ps can't steal cycles from FP math ops (port 0 and 1 for 256-bit). That could make sense, and is maybe even useful for people tuning a loop that bottlenecks on p01 throughput vs. shuffle throughput: they can use vshufpd y, same,same, i to let it run on port 1 or 5, or just for smaller machine-code size (2-byte VEX). Or vpermilpd y, ymm/mem, i to restrict it to port 5, at the cost of an extra byte of machine-code size if vshufpd didn't already need a 3-byte VEX. (Or a whole separate instruction if it was shuffling a memory source. But like many instructions with an immediate operand, Intel CPUs can't micro-fuse the load+ALU uop, so the cost in issue bandwidth is the same.)

That seems unlikely. Maybe they just analyzed existing code and found shufpd / vshufpd was more common and thus important; unsurprising since shufpd is SSE2 but vpermilpd didn't exist until AVX1. So that factor may be what affected this design which is relevant for choosing YMM shuffles, even though both vshufpd ymm and vpermilpd were new with AVX1.

But guessing about the future, Intel gracemont E-cores in Alder Lake have identical performance for vpermilpd ymm, ymm, i8 vs. vshufpd ymm, ymm, ymm, i8.