If you look in the source code of the System.Numerics.Matrix4x4
class of .NET under multiply and other functions, it does an if
check to see if hardware supports respectively:
if (AdvSimd.Arm64.IsSupported) {} else if (Sse.IsSupported) {}
But the generic System.Numerics.Vector<T>
struct seems to do all the same, what is the difference? Does Vector<T>
not simply look behind the scenes and use whichever is available, and then a software fallback if none of them are?
C# System.Numerics
Vector<T>
generic SIMD doesn't expose all the shuffles and other ISA-specific things like x86 movmskps
. If you can get the job done efficiently with the common subset of functionality exposed with the generic API, I'd assume that would be a good choice and still compile to the instructions you'd exepct.
But the function you mentioned uses Sse.Shuffle
(shufps
) or AdvSimd.Arm64.FusedMultiplyAddBySelectedScalar
(?) to broadcast and mul+add. If ARM64 can actually do that in a single instruction (scalar broadcast source for a vector multiply), that's pretty cool. The predecessor to AVX-512 could do that, KNC new instructions in early Xeon Phi, but even AVX-512 needs a shuffle and a separate FMA. (Unless the operand is coming from memory: AVX-512 can use a broadcast memory source operand.)
I don't see any shuffles at all in the docs you linked for System.Numerics
, only pure vertical SIMD, so that's not very useful for a 4x4 matrix product where each row[i]
needs to get multiplied by a broadcast(col[i])
vector.
So System.Numerics
looks way more crippled that GNU C native vectors in C and C++ where there at least is a __builtin_shuffle
, but still missing out on special shuffles, and stuff like x86 movmskps
to get a scalar bitmap of SIMD compare results. (Which AMD and ARM64 have no direct equivalent for.)