I've been wondering.. It's called SIMD as in single instruction multiple data. So why does it have single data instructions?
For example, vaddss
is the single data equivalent of the multiple data vaddps
. Just about every SIMD instruction have a single data version.
Why?
Why does SIMD have single data instructions when it's called SIMD?
vaddss
is a scalar FP math instruction that operates on data in the FP/SIMD registers (XMM0..15). It exists because x87 is not a very convenient compiler target with its stack-based registers that often need fxch
, and other quirks. Intel added a new way to do scalar FP math along with SSE1 (float) and SSE2 (double), which is fortunately baseline for x86-64 so everyone can just use it.
People who call that a SIMD instruction are talking about one of:
sd
scalar double, where the low double is one half of an XMM register.)Or they're just plain wrong if they actually mean it in terms of Flynn's taxonomy of SISD vs. SIMD vs. MIMD etc. I highly doubt anyone would actually mean that, though. The ss
and sd
scalar FP math instructions are SISD, single-instruction single-data. And BTW, they only exist for FP math; x86 already has instructions like add eax, ecx
for scalar integer math, and doesn't have scalar versions of paddb
or even xorps
.
One reason for having separate scalar FP math instructions is that using addps
would also operate on whatever garbage might be in the high elements of XMM registers. This can raise extra FP exceptions (usually masked, so only recorded in MXCSR (fenv.h
), but if unmasked would trap to the OS.)
With the upper elements all 0.0
(which isn't required by the calling convention, BTW), addps
wouldn't raise any extra exceptions, but divps
would divide by zero.
With non-zero garbage like small integers, it might be a bit-pattern for a subnormal float, or a result might be subnormal, causing huge slowdowns (factor of ~100) as the CPU takes a microcode assist to get handle subnormal input or output in many cases (or when SSE1 was new in Pentium III, probably all cases of subnormals). Unless you set FTZ and DAZ (flush to zero, denormal are zero) like gcc -ffast-math
does.
For instructions like xorps
or paddq
which don't do actual FP math, no FP exceptions or microcode assists are possible. You can just use them even if you only care about the low 32 or 64 bits of an XMM.
MMX or SSE2 had occasional uses in 32-bit code for doing scalar 64-bit integer math, with zeros or garbage in the upper bytes. MMX paddq mm0, mm1
is a SISD instruction, but SSE2 paddq xmm0, xmm1
is a SIMD instruction.
SSE1 was new in Pentium 3, where the SIMD execution units and registers were only 64 bits wide. addps
decoded to 2 uops; addss
decoded to 1. So there was a performance motivation, too, even in the best case.
This is also likely the reason for Intel's unfortunate design where sqrtss
and cvtsi2ss
and others merge into the destination, requiring either spending extra front-end bandwidth on xor-zeroing, or risking false dependencies: Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? . It's a short-sighted design decision to make them single-uop on Pentium 3, which they unfortunately followed in SSE2 for double
precision, and stuck to for AVX and AVX-512 when they had a chance to introduce better versions with different semantics. At least the AVX versions take a 2nd source register to merge with, so you can pick a "cold" reg as a workaround, see my answer on the linked duplicate.
It isn't necessary or useful to have yet another set of registers for scalar FP, and sharing with the x87 FPU or the general-purpose integer registers would each be worse for separate reasons.
It's totally normal on other ISAs for the SIMD registers to overlap or be the same as the scalar FP registers; Some ISAs (like ARM) that didn't have weirdo designs like x87 didn't need new architectural state to introduce SIMD. e.g. ARM's NEON q0..q15
16-byte registers map to pairs of d0..d31
double-precision FP registers that existed with VFPv3.
(I'm not sure if the partial-register aliasing was actually common in SIMD extensions for other ISAs, though. Probably some introduced new architectural state, or just used FP double-precision registers as 64-bit integer SIMD instead of 128-bit.)
In an OS kernel you often talk about saving "FPU state" on context switch (as opposed to just the general-purpose integer registers), and these days that's short-hand for FPU and SIMD state. e.g. in the Linux kernel, you need to use kernel_fpu_begin()
before running instructions that use XMM/YMM/ZMM registers. (e.g. in the RAID5 / RAID6 drivers).