Why don't wider versions of VDPPD / VDPPS exist, like 512-bit bit?

I have only been able to find 128-bit versions of the Vector Dot Product AVX/SIMD instructions

Are there 512-bit versions?
If not (i can't find them), any ideas as to why? It seems like a reasonably important instruction family, at least in my opinion, for various applications, not the least of which is summing the elements of the zmm? vectors.

Solution

dpps is usually only helpful in the first place if you're using SIMD "wrong" (inefficiently), for horizontal operations within vectors rather than vertical across multiple vectors. (Sometimes you do that anyway if you can get some speedup for something without re-engineering some data structure that a lot of existing code uses, or if other use-cases for the data are more important.)

Slides + text: SIMD at Insomniac Games (GDC 2015) have some good stuff about designing your data structures to be SIMD friendly (SoA vs. AoS), and how that can give much better speedups than abusing a SIMD vector to hold an x,y,z geometry vector.

I have only been able to find 128-bit versions of the Vector Dot Product AVX/SIMD instructions

AVX1 includes vdpps ymm, ymm, ymm/m256, imm8. It does two independent 128-bit DPPS operations in the two lanes of a YMM, like for most other horizontal / shuffle instructions that got widened in less-than-useful ways in AVX and especially AVX2. Perhaps that's what you meant?

It seems like a reasonably important instruction family

Not at all. It's only useful for dot products of 2 to 4 elements, and decodes to multiple uops anyway. Can be worth using on Intel if it does exactly what you need, but AMD just microcodes it. https://uops.info/ / https://agner.org/optimize/. Or at least on Intel before Ice Lake also mostly "gave up" on it, decoding as 6 uops (14c latency), up from 4 in Skylake (13c latency) for vdpps xmm. Still not as bad as Zen's 8 uops (15c latency).

Geometry vectors of 3 elements in a 4-element SIMD vector get (ab)used some; that's what dpps is for. (And possibly sometimes stuff like 3x3 or 4x4 matrices, maybe even 2x2.) Two separate in-lane DPPS operations for vdpps ymm are probably more useful for getting some SIMD benefit for Array-of-Structs (AoS) data formats than a single 8-wide operation.

dpps isn't useful for dot products of larger arrays. For arrays, FMA into multiple accumulators which you only combine horizontally at the end, like in Improving performance of floating-point dot-product of an array with SIMD. See also this Q&A for some perf-tuning experiments.

A lane-crossing vdpps wouldn't be worth building any more dedicated hardware in execution units for, vs. just letting software handle arrays of 5 or more elements or micro-coding it. AVX-512 masking makes the immediate control operand of dpps less valuable. (The immediate lets you ignore some input elements, and zero vs. broadcast the result to your choice of elements, with two 4-bit bitmasks.)

That raises another point: vdpps already uses all the bits in the immediate. (The YMM version uses the same immediate control for both halves). So there wouldn't be room to scale it up to an 8-wide operation without changing how the immediate works. (e.g. maybe always broadcast, dropping the output 0-masking, so you an input control mask bit for each of 8 floats in a YMM? But that would have required custom hardware separate from what's needed for the xmm version).

Note that without any special instructions / uops, you can do the same thing as vdpps (ignoring the functionality of the immediate) with vmulps and then a horizontal sum. (vshufps / vaddps / vshufps / vaddps to leave the result broadcasted to each element.) Intel's hardware support brings this down from 5 to 4 uops (including handling the masking) on Skylake, 3p01 (the FP math execution units) plus 1 p5 (shuffle unit).

(Unfortunately haddps can't take advantage of any of that horizontal-FP hardware :/)

Ice Lake must have dropped some of this dedicated hardware since it's so rarely used; it's 6 uops there. (The 2 extra uops are for p0/p6 and p1/p5, so maybe a shuffle and maybe something turning the immediate into a mask? port 6 doesn't have any vector ALU execution units even in Ice Lake, AFAIK.)

Replicating that special-purpose hardware that dpps uses to an even wider vector width for AVX-512 would probably not have been worth the cost in transistors. 512-bits is very wide for a CPU; the FMA unit(s) already take significant area, and AVX-512 introduces a bunch of new instructions, including more lane-crossing shuffles, that are useful more of the time than a wider vdpps zmm would be on average across most code.

Usually dpps is only helpful in the first place if you're using SIMD wrong.