.net bit-manipulation simd hardware-acceleration

System.Numerics.Vectors 'Vector<T>': is it basically just System.UInt128?

I'm looking into Vector<T> in the System.Numerics.Vectors namespace from version 4.5.0-preview1-26216-02. MSDN documentation says:

Vector<T> is an immutable structure that represents a single vector of a specified numeric type. The count of a Vector<T> instance is fixed, but its upper limit is CPU-register dependent.
_{https://learn.microsoft.com/en-us/dotnet/api/system.numerics.vector-1 (emphasis added)}

Even overlooking the misguided wording "count [sic.] of a Vector", this sentence seems quite unclear, since it implies that different Vector<T> instances might have different--albeit "fixed" up to some CPU-limit--"counts" (again, so-called 'count' of what, exactly?) There's no mention of an actual Count property here--or in fact anywhere on the intro page).

Now normally, I think "read-only" or "immutable" are more conventionally used than "fixed" for describing an instance's properties or fields, but in this case it turns out that the Vector<T>.Count property, while indeed read-only, is also static, and thus in no-way associated with any Vector<T> instance. Instead, its value varies only according to the generic type argument T (and then presumably from machine-to-machine, as indicated):

bool hw = Vector.IsHardwareAccelerated;    // --> true

var c = (Vector<sbyte>.Count, Vector<short>.Count, Vector<int>.Count, Vector<long>.Count);
Debug.WriteLine(c);    // -->  (16, 8, 4, 2)

Oh.

So is it basically System.Int128 in disguise? And is that it? My questions are:

Am I missing something? It's true that I knew ~~little~~ nothing about SIMD, but I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits. My HPSG parsing engine routinely performs intensive bitwise computations vectors of 5,000+ bits.
Again assuming I'm not missing the point, why not just call it System.Int128 / System.UInt128 instead of Vector<T>? Parameterizing it with the generic primitive types does pose certain benefits, but gave me the wrong idea that it was more a usefully expansive array (i.e., namely, of blittable elements T), as opposed to just a double-width CPU register, which to my mind is about as "scalar" as you can get.

Don't get me wrong, a 128-bit register is interesting, useful, and exciting stuff--if just a bit oversold here maybe? For example, Vector<byte> is going to have 16 elements no matter what, regardless of whether you need or use them all, so the spirit of Count which is expected to vary by instance at runtime seems to be misapplied here.
Even though a single Vector<T> won't directly handle the use case I described as I was hoping, would it be worth it to update my current implementation (which uses a ulong[N >> 6] array for each N-bit vector) to instead use the Vector<ulong>[N >> 7] array?

_{...yes, that's "array of Vector<ulong>", which again seems strange to me; shouldn't a type with "Vector" in its name be sufficiently or usefully extensible without having to explicitly create an array to wrap multiple instances?}
Beyond the fact that each 128-bit SIMD bitwise operation is processing twice as much data, are SIMD bitwise operations additionally faster (or slower) in cycles per opcode as well?
Are there other hardware platforms in common use or availability today where System.Numerics.Vectors actually reports a different SIMD bit-width?

Solution

The vector size is not always 16 bytes, though that is very common. For example on a platform with AVX2, programs run in 64bit mode get 32 byte vectors. In this way the Count property can also vary on the same machine (for the same T), by running the program in different modes. In principle it wouldn't have to be like that, a 32-bit program could still use 256-bit operations even with just AVX1 support, but that's not how System.Numerics.Vectors works. The varying size per feature-level of the CPU is a fairly fundamental part of the design of the API, presumably to enable some form of future-proofing, though it perhaps contributed to the lack of shuffles (which would be hard to specify for a vector of not-statically-known size).

I thought this library would allow the use of much wider hardware-accelerated datatypes than just 128 bits

That doesn't exist in the hardware so it would be hard to offer. AVX-512 goes up to 512 bits as the name implies, but that's as far as SIMD on mainstream CPUs goes for now.

why not just call it System.Int128 / System.UInt128

I would expect those types to map to actual integer types, not vector types. Many operations that would make sense on a 128-bit integers do not actually exist as CPU instructions, and almost all operations that do exist operate on 2 × 64 (Vector<long>, long[2]), 4 × 32 (Vector<int>, int[4]), 8 × 16 (Vector<short>, short[8]) or 16 × 8 (Vector<byte>, byte[16]) bit vectors (or double those widths on platforms that support it). Offering "byte-wise add" operation on an Int128 would be strange, and not offering true 128-bit addition makes it even stranger. Besides as mentioned earlier, the size is not by definition 128 bits, that's just common.

Many SIMD operations are quite fast, though there are some exceptions. 32-bit multiplication typically has a rather extreme latency for example. The System.Numerics.Vectors API also allows some non-existent operations (that must be emulated slowly, such as integer division or byte multiplication) without hinting that there is a problem. Operations that map to instructions that actually exist are mostly fast.

While bitwise operations on ulong are also fast, their vector versions are even better when viewed in terms of "total work done per unit time." For example, Skylake can execute (at best) four scalar bitwise operations per cycle (but extra operations like an addition and compare/branch to make a loop would compete over the same resource) but executes three 256-bit bitwise operations with SIMD which is 3 times the amount of work in the same time and it leaves an execution port open for a scalar operation or branch.

So yes, it's probably worth using. You could keep the array of ulong and use the construct-from-array constructor of Vector<T>, that way you don't have to deal with vectors everywhere. For example, indexing into a vector with a variable index is not a nice operation at all, causing a branch, vector store and scalar reload. The variable-size nature of the vectors obviously also significantly complicates using arrays of them directly rather than using array of primitive types and then vector-loading from them. You could easily round up the length of the array to a multiple of the vector count though, to remove the need for a small scalar loop to handle to remaining items at the end of an array that don't quite fit in a vector.