parallel-processing cpu-architecture simd sse intrinsics

does the instruction sqrtpd calculate the sqrt at the same time?

I'm learning SIMD intrinsics and parallel computing. I am not sure if Intel's definition for the x86 instruction sqrtpd says that the square root of the two numbers that are passed to it will be calculated at the same time:

Performs a SIMD computation of the square roots of the two, four, or eight packed double-precision floating-point values in the source operand (the second operand) and stores the packed double-precision floating-point results in the destination operand (the first operand).

I understand that it explicitly says SIMD computation but does this imply that for this operation the root will be calculated simultaneously for both numbers?

Solution

For sqrtpd xmm, yes, modern CPUs do that truly in parallel, not running it through a narrower execution unit one at a time. Older (especially low-power) CPUs did do that. For AVX vsqrtpd ymm, some CPUs do perform it in two halves.

But if you're just comparing performance numbers against narrower operations, note that some CPUs like Skylake can use different halves of their wide div/sqrt unit for separate sqrtpd/sd xmm, so those have twice the throughput of YMM, even though it can do a full vsqrtpd ymm in parallel.

Same for AVX-512 vsqrtpd zmm, even Ice Lake splits it up into two halves, as we can see from it being 3 uops (2 for port 0 where Intel puts the div/sqrt unit, and that can run on other ports.)

Being 3 uops is the key tell-tale for a sqrt instruction being wider than the execution unit on Intel, but you can look at the throughput of YMM vs. XMM vs. scalar XMM to see how it's able to feed narrower operations do different pipes of a wide execution unit independently.

The only difference is performance; the destination x/y/zmm register definitely has the square roots of each input element. Check performance numbers (and uop counts) on https://uops.info/ (currently down but normally very good), and/or https://agner.org/optimize/.

It's allowed but not guaranteed that CPUs internally have wide execution units, as wide as the widest vectors they support, and thus truly compute all results in parallel pipes.

Full-width execution units are common for instructions other than divide and square root, although AMD from Bulldozer through before Zen1 supported AVX/AVX2 with only 128-bit execution units, so vaddps ymm decoded to 2 uops, doing each half separately. Intel Alder Lake E-cores work the same way.

Some ancient and/or low-power CPUs (like Pentium-M and K8, and Bobcat) have had only 64-bit wide execution units, running SSE instructions in two halves (for all instructions, not just "hard" ones like div/sqrt).

So far only Intel has supported AVX-512 on any CPUs, and (other than div/sqrt) they've all had full-width execution units. And unfortunately they haven't come up with a way to expose the powerful new capabilities like masking and better shuffles for 128 and 256-bit vectors on CPUs without the full AVX-512. There's some really nice stuff in AVX-512 totally separate from wider vectors.

The SIMD div / sqrt unit is often narrower than others

Divide and square root are inherently slow, not really possible to make low latency. It's also expensive to pipeline; no current CPUs can start a new operation every clock cycle. But recent CPUs have been doing that, at least for part of the operation: I think they normally end with a couple steps of Newton-Raphson refinement, and that part can be pipelined as it only involves multiply/add/FMA type of operations.

Intel has supported AVX since Sandybridge, but it wasn't until Skylake that they widened the FP div/sqrt unit to 256-bit.

For example, Haswell runs vsqrtpd ymm as 3 uops, 2 for port 0 (where the div/sqrt unit is) and one for any port, presumably to recombine the results. The latency is just about a factor of 2 longer, and throughput is half. (A uop reading the result needs to wait for both halves to be ready.)

Agner Fog may have tested latency with vsqrtpd ymm reading its own result; IDK if Intel can let one half of the operation start before the other half is ready, of if the merging uop (or whatever it is) would end up forcing it to wait for both halves to be ready before starting either half of another div or sqrt. Instructions other than div/sqrt have full-width execution units and would always need to wait for both halves.

I also collected divps / pd / sd / ss throughputs and latencies for YMM and XMM on various CPUs in a table on Floating point division vs floating point multiplication