cpu-architecture simd sse cpu-registers avx

Why does SIMD have single data instructions when it's called SIMD?

I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?

For example, vaddss is the single data equivalent of the multiple data vaddps. Just about every SIMD instruction have a single data version.

Why?

Why does SIMD have single data instructions when it's called SIMD?

Solution

It isn't a SIMD instruction in that sense

vaddss is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.

People who call that a SIMD instruction are talking about one of:

Which registers it operates on. (XMM0 is 16 bytes wide and clearly a SIMD register, even when you only care about the low element holding a scalar value.)
The fact that it's an AVX instruction, so it was introduced with an ISA extension that was primarily aimed at SIMD usage, and thus is called a SIMD extension or instruction set.
Which also means it uses the MXCSR for rounding mode and FP exception recording / unmasking, and the kinds of exceptions it can take are the same as other SSE/AVX instructions which Intel documents as "SIMD Floating-Point Exceptions" as concise terminology to distinguish it from legacy x87.
Or they're talking about the use-case of doing something to just the low element when the high elements have actual data. (Quite rare, but something you could do. Maybe more likely with sd scalar double, where the low double is one half of an XMM register.)

Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss and sd scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx for scalar integer math, and doesn't have scalar versions of paddb or even xorps.

One reason for having separate scalar FP math instructions is that using addps would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h), but if unmasked would trap to the OS.)

With the upper elements all 0.0 (which isn't required by the calling convention, BTW), addps wouldn't raise any extra exceptions, but divps would divide by zero.

With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math does.

For instructions like xorps or paddq which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.

MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1 is a SISD instruction, but SSE2 paddq xmm0, xmm1 is a SIMD instruction.

SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps decoded to 2 uops; addss decoded to 1. So there was a performance motivation, too, even in the best case.

This is also likely the reason for Intel's unfortunate design where sqrtss and cvtsi2ss and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.

It's normal for scalar FP to share registers with SIMD

It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.

It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15 16-byte registers map to pairs of d0..d31 double-precision FP registers that existed with VFPv3.

(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)

In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin() before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).