c#.net performance simd .net-generic-math

What is the most performant way to do arithmetic on a few generic numbers contained within a generic struct in C# in .NET 8+?

What is the most performant way to do arithmetic on a few generic numbers wrapped in an unmanaged, generic, and mutable struct in C# in .NET 8+ ? Unsafe code CAN be used.

What is the best performing way to extract primitive numeric types from a containing struct using generics?

The few primitive numeric types are wrapped inside of an unmanaged, generic, and mutable struct, which is available only via generic context and assumed to be the size of a multiple of TNum's size:

static unsafe void Add<TNum, TStruct>(TStruct s, TNum n) where TNum : unmanaged, INumber<TNum> where TStruct : unmanaged {

}

So the question would be assuming TStruct in the example above contains say exactly 3 TNums which are for example shorts (so the struct's size is 6 bytes). How should the above routine be implemented in the most performant way (we want to add a certain number to each of those 3 shorts in a struct)?

What is the most performant way to "extract" individual TNums from such generic structs:

static unsafe void Add<TNum, TStruct>(TStruct s, TNum n) where TNum : unmanaged, INumber<TNum> where TStruct : unmanaged {
  ref var v0 = ref Unsafe.AsRef<TNum>(Unsafe.AsPointer(ref s));
  ref var v1 = ref Unsafe.AsRef<TNum>(Unsafe.Add<TNum>(Unsafe.AsPointer(ref s), 1));
  ...
  v0 = ...
  v1 = ...
}

static unsafe void Add<TNum, TStruct>(TStruct s, TNum n) where TNum : unmanaged, INumber<TNum> where TStruct : unmanaged {
  var v = (TNum*)Unsafe.AsPointer(ref s);
  v[0] = ...
  v[1] = ...
}

Something else?

What is the best performing, likely SIMD-assisted way to perform same arithmetic operation on a few (2-6) generic numbers?

And then, once the TNums have been "extracted", what is the mst efficient way to perform the same arithmetic operation on them in a generic context (we do not know ahead of time what specific primitive type TNums will end up being)?

So, given generic math support in .NET 8+, I am looking for a most efficient way to perform basic arithmetic on generic primitive TNums (like byte, short, int, float and a few others) when there are a few of them (i.e. more than 1, but less than say 6 - similar to struct example above).

One could use System.Numerics.Vector and System.Numerics.Vector<T>, however the problem there is that the size of the vector must match the total size of the primitives you are working with, which is not possible in the generic scenario.

So is there some general approach that will work on modern processors regardless of the underlying primitive type to say for example SIMD-add 3 primitive numbers, or to add a constant to all 3 of them? Unsafe code can be used.

UPDATE:

Just to clarify:

This question is not about a specific problem, rather it applies really to any unamanged (contains no pointers to managed objects) and mutable struct that for the purposes of this question is really just a blob of in-memory bits where it is known in advance that it is a multiple of TNum in size and its layout is sequential.
The code is going to be performed many millions of times, so performance is an issue, and if it was ok with just doing it one field at a time, I would not have posted the question.

Solution

You have a huge array of a generic, mutable struct TContainer that consists entirely of a few unmanaged primitive numbers of the same type TNumber, and you would like to find the most performant way to do generic math on the values in the structs, for instance adding a constant value to all the fields.

First, some terminology. An array of primitive numbers, or a struct containing primitive numbers of uniform type, or an array of such structs, can all be treated in a uniform manner by using spans. For instance, a value tuple of doubles can be mutated by reinterpreting a reference to it as a span of doubles using MemoryMarshal.Cast():

var tuple = (1.0, 1.0);
var span = MemoryMarshal.Cast<(double, double), double>(new Span<(double, double)>(ref tuple));
foreach (ref var d in span)
    d += 1.0;
Console.WriteLine(tuple); // Prints (2, 2)

Since the same foreach loop could be used to mutate a span referring to an array of doubles, spans provide a convenient (and performant) tool to address your problem.

When doing generic math on spans of numbers, you could try using Vector<T> to speed your numeric operations using SIMD instructions:^[1]

The Vector<T> gives the ability to use longer [SIMD] vectors. The count of a Vector<T> instance is fixed, but its value Vector<T>.Count depends on the CPU of the machine running the code.

On my computer the fixed size of Vector<T> is 32 bytes, but apparently 16 bytes is also common, see here for details. If the number of fields in your Container ls greater than or equal to Vector<TNumber>.Count you could try to mutate it using SIMD instructions. Similarly if you need to mutate all the structs in your huge array (or some large slice of the same) by adding some fixed value to all of them you could also use Vector<TNumber>.

With that in mind, I created large arrays of a couple of test structs, both at least 32 bytes in size:

public struct Box3D<TNumber> where TNumber : unmanaged, INumber<TNumber>
{
    public Point3D<TNumber> Min, Max;
}
public struct Point3D<TNumber> where TNumber : unmanaged, INumber<TNumber>
{
    public TNumber X, Y, Z;
}

And

ValueTuple<float, float, float, float, float, float, float, float> // 8 floats

And created several candidate methods to transform spans of them generically:

Passing the TContainer by reference, span, or value.
Either using Vector<T> if possible, not using it at all, or using some pre-initialized vector passed by reference or value.
Using unsafe pointer arithmetic or not.
Or using some hardcoded transformation method for comparison.

The methods are:

public static partial class NumericExtensions
{
    //-- Methods to add some number to some container reference(s) with automatic vectorization when possible

    public static Span<TContainer> AddToContainerSpan<TContainer, TNumber>(Span<TContainer> span, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert((Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>())); // We are assuming that TContainer contains one or more TNumber fields, so it makes no sense if TContainer is smaller than TNumber

        AddToSpan<TNumber>(MemoryMarshal.Cast<TContainer, TNumber>(span), n);
        return span;
    }

    public static ref TContainer AddToContainerRef<TContainer, TNumber>(ref TContainer s, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert(Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>());

        AddToSpan(MemoryMarshal.Cast<TContainer, TNumber>(new Span<TContainer>(ref s)), n);
        return ref s;
    }

    public static Span<TNumber> AddToSpan<TNumber>(Span<TNumber> span, TNumber n) where TNumber : unmanaged, INumber<TNumber>
    {
        Debug.Assert(Vector<TNumber>.IsSupported); // Not checked in release builds for performance

        if (Vector<TNumber>.Count is var vectorLength
            // Here you could check for length being >= some multiple of vectorLength if you determine a point where the benefits of vectorization outweigh the setup costs
            && span.Length >= vectorLength) 
        {   
            int remaining = span.Length % vectorLength;
            var offset =  new Vector<TNumber>(n);
            for (int i = 0; i < span.Length - remaining; i += vectorLength)
            {
                var slice = span.Slice(i, vectorLength);
                var v1 = new Vector<TNumber>(slice);
                (v1 + offset).CopyTo(slice);
            }
            for (int i = span.Length - remaining; i < span.Length; i++)
                span[i] += n;
        }
        else
        {
            for (int i  = 0; i < span.Length; i++)
                span[i] += n;
        }
        return span;
    }

    //-- Methods to add some number to some container reference(s) with no vectorization

    public static Span<TContainer> AddToContainerSpanNoVectorization<TContainer, TNumber>(Span<TContainer> span, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert(Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>());

        var numberSpan = MemoryMarshal.Cast<TContainer, TNumber>(span);
        for (int i  = 0; i < numberSpan.Length; i++) // Inlined for efficiency.
            numberSpan[i] += n;
        return span;
    }

    public static ref TContainer AddToContainerRefNoVectorization<TContainer, TNumber>(ref TContainer s, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert(Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>());

        var numberSpan = MemoryMarshal.Cast<TContainer, TNumber>(new Span<TContainer>(ref s));
        for (int i  = 0; i < numberSpan.Length; i++) // Inlined for efficiency.
            numberSpan[i] += n;
        return ref s;
    }
    
    public static Span<TNumber> AddToSpanNoVectorization<TNumber>(Span<TNumber> span, TNumber n) where TNumber : unmanaged, INumber<TNumber>
    {
        for (int i  = 0; i < span.Length; i++)
            span[i] += n;
        return span;
    }
    
    //-- Methods to add some predefined Vector<TNumber> to some container reference(s)

    public static Span<TContainer> AddVectorRefToContainerSpan<TContainer, TNumber>(Span<TContainer> span, ref readonly Vector<TNumber> offsetVector) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert(Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>());

        AddVectorRefToSpan<TNumber>(MemoryMarshal.Cast<TContainer, TNumber>(span), in offsetVector);
        return span;
    }

    public static Span<TNumber> AddVectorRefToSpan<TNumber>(Span<TNumber> span, ref readonly Vector<TNumber> offsetVector) where TNumber : unmanaged, INumber<TNumber>
    {
        // Modeled on https://learn.microsoft.com/en-us/dotnet/standard/simd#vectort
        // Vector<TNumber>.IsSupported should be checked by caller
        Debug.Assert(Vector<TNumber>.IsSupported); // Not checked in release builds for performance

        var vectorLength = Vector<TNumber>.Count;
        var length = span.Length;
        int remaining = length % vectorLength;
        for (int i = 0; i < length - remaining; i += vectorLength)
        {
            var slice = span.Slice(i, vectorLength);
            var v1 = new Vector<TNumber>(slice);
            (v1 + offsetVector).CopyTo(slice);
        }
        if (remaining > 0)
        {
            var remainingSpan = span.Slice(length - remaining);
            for (int i = 0; i < remainingSpan.Length; i++)
            {
                remainingSpan[i] += offsetVector[i];
            }
        }
        return span;
    }

    public static Span<TContainer> AddVectorValueToContainerSpan<TContainer, TNumber>(Span<TContainer> span, Vector<TNumber> offsetVector) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        Debug.Assert(Unsafe.SizeOf<TContainer>() >= Unsafe.SizeOf<TNumber>());

        AddVectorValueToSpan<TNumber>(MemoryMarshal.Cast<TContainer, TNumber>(span), offsetVector);
        return span;
    }

    public static Span<TNumber> AddVectorValueToSpan<TNumber>(Span<TNumber> span, Vector<TNumber> offsetVector) where TNumber : unmanaged, INumber<TNumber>
    {
        // Modeled on https://learn.microsoft.com/en-us/dotnet/standard/simd#vectort
        // Vector<TNumber>.IsSupported should be checked by caller
        Debug.Assert(Vector<TNumber>.IsSupported); // Not checked in release builds for performance

        var vectorLength = Vector<TNumber>.Count;
        var length = span.Length;
        int remaining = length % vectorLength;
        for (int i = 0; i < length - remaining; i += vectorLength)
        {
            var slice = span.Slice(i, vectorLength);
            var v1 = new Vector<TNumber>(slice);
            (v1 + offsetVector).CopyTo(slice);
        }
        if (remaining > 0)
        {
            var remainingSpan = span.Slice(length - remaining);
            for (int i = 0; i < remainingSpan.Length; i++)
            {
                remainingSpan[i] += offsetVector[i];
            }
        }
        return span;
    }

#if !NOUNSAFE   
    //-- Method to add some number to some container(s) using unsafe pointer arithmetic

    public static unsafe ref TContainer AddToContainerUnsafe<TContainer, TNumber>(ref TContainer s, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged 
    {
        var span = MemoryMarshal.Cast<TContainer, TNumber>(new Span<TContainer>(ref s));
        fixed (TNumber *pNumber = span)
        {
            for (int i = 0, count = span.Length; i < count ; i++)
            {
                pNumber[i] += n;
            }
        }
        return ref s;
    }
#endif 

    //-- Method to add some number to some container(s) BY COPY with no vectorization

    public static TContainer AddToContainerValueNoVectorization<TContainer, TNumber>(TContainer s, TNumber n) where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
    {
        // Generic struct + math
        var span = MemoryMarshal.Cast<TContainer, TNumber>(new Span<TContainer>(ref s));
        for (int i  = 0; i < span.Length; i++)
            span[i] += n;
        return s;
    }

    //-- Hardcoded method to add a double to a Box3D<double>

    public static ref Box3D<double> AddHardcoded(ref Box3D<double> s, double n)
    {
        // Hardcoded struct, double math
        s.Min.X += n;
        s.Min.Y += n;
        s.Min.Z += n;
        s.Max.X += n;
        s.Max.Y += n;
        s.Max.Z += n;
        
        return ref s;
    }
}

Then, using BenchmarkDotNet I created benchmarks for all the above using arrays of Box3D<double> and value tuples of 8 floats, using arrays of length 1, 2, 3, 4, 5, 10 and 1000 and ran them in .NET 8:

public partial class Box3DTestClass
{
    //https://benchmarkdotnet.org/articles/features/setup-and-cleanup.html
    [Params(1, 2, 3, 4, 5, 10, 1000)]
    public int ArrayCount;

    private Box3D<double> [] array = [];
    Vector<double> offsetVector;
    double offset;
    
    [GlobalSetup]
    public void SetupArrays()
    {
        array = new Box3D<double>[ArrayCount];
        Array.Fill(array, new () { Min = new() { X = 1, Y = 1, Z = 1 }, Max = new() { X = 2, Y = 2, Z = 2 }});
        offset = 2.0;
        offsetVector = new Vector<double>(offset);
    }
    
    [Benchmark]
    public void Benchmark_AddToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddToContainerSpan(array.AsSpan(), offset);
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerSpans()
    {
        var span = array.AsSpan();
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerSpan(span.Slice(i, 1), offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerRefs()
    {
        // Benchmark adding a value to the array item by item by reference
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerRef(ref array[i], offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddToContainerSpanNoVectorization()
    {
        // Benchmark adding a value to the entire array at oncce but without vectorization
        NumericExtensions.AddToContainerSpanNoVectorization(array.AsSpan(), offset);
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerSpansNoVectorization()
    {
        var span = array.AsSpan();
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerSpanNoVectorization(span.Slice(i, 1), offset);
        }
    }
    
    [Benchmark]
    public void Benchmark_AddToSingleItemContainerRefsNoVectorization()
    {
        // Benchmark adding a value to the array item by item by reference
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerRefNoVectorization(ref array[i], offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddVectorRefToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddVectorRefToContainerSpan(array.AsSpan(), ref offsetVector);
    }

    [Benchmark]
    public void Benchmark_AddVectorValueToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddVectorValueToContainerSpan(array.AsSpan(), offsetVector);
    }

#if !NOUNSAFE   
    [Benchmark]
    public void Benchmark_AddToContainerUnsafe()
    {
        // Benchmark adding a value to the array item by item by reference using unsafe code
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerUnsafe(ref array[i], offset);
        }
    }
#endif  

    [Benchmark]
    public void Benchmark_AddToContainerValueNoVectorization()
    {
        // Benchmark adding a value to the array item by item by copy
        for (int i = 0; i < array.Length; i++)
        {
            array[i] = NumericExtensions.AddToContainerValueNoVectorization(array[i], offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddHardcoded()
    {
        // Benchmark adding a value to the array item by item using a builtin hardcoded method 
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddHardcoded(ref array[i], offset);
        }
    }
}

public partial class TupleTestClass
{
    //https://benchmarkdotnet.org/articles/features/setup-and-cleanup.html
    [Params(1, 2, 3, 4, 5, 10, 1000)]
    public int ArrayCount;

    // I chose a tuple of 8 floats because it's the same size as Vector<float>
    private (float, float, float, float, float, float, float, float) [] array = [];
    Vector<float> offsetVector;
    float offset;
    
    [GlobalSetup]
    public void SetupArrays()
    {
        array = new (float, float, float, float, float, float, float, float)[ArrayCount];
        Array.Fill(array, (1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f));
        offset = 2.0f;
        offsetVector = new Vector<float>(offset);
    }
    
    [Benchmark]
    public void Benchmark_AddToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddToContainerSpan(array.AsSpan(), offset);
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerSpans()
    {
        var span = array.AsSpan();
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerSpan(span.Slice(i, 1), offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerRefs()
    {
        // Benchmark adding a value to the array item by item by reference
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerRef(ref array[i], offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddToContainerSpanNoVectorization()
    {
        // Benchmark adding a value to the entire array at oncce but without vectorization
        NumericExtensions.AddToContainerSpanNoVectorization(array.AsSpan(), offset);
    }

    [Benchmark]
    public void Benchmark_AddToSingleItemContainerSpansNoVectorization()
    {
        var span = array.AsSpan();
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerSpanNoVectorization(span.Slice(i, 1), offset);
        }
    }
    
    [Benchmark]
    public void Benchmark_AddToSingleItemContainerRefsNoVectorization()
    {
        // Benchmark adding a value to the array item by item by reference
        for (int i = 0; i < array.Length; i++)
        {
            NumericExtensions.AddToContainerRefNoVectorization(ref array[i], offset);
        }
    }

    [Benchmark]
    public void Benchmark_AddVectorRefToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddVectorRefToContainerSpan(array.AsSpan(), ref offsetVector);
    }

    [Benchmark]
    public void Benchmark_AddVectorValueToContainerSpan()
    {
        // Benchmark adding a value to the entire array at oncce
        NumericExtensions.AddVectorValueToContainerSpan(array.AsSpan(), offsetVector);
    }
}

I can summarize the results as follows:

Always access and pass your structs by reference rather than by value. Access them using reference variables and pass them using reference parameters, e.g:
```
for (int i = 0; i < array.Length; i++)
{
    ref item = ref array[i];
    // Call some method to modify `item` passing it by reference:
    Modify(ref item);
}
```
Passing by value could slow your algorithm down by 3-6 times. You won't be able to use LINQ when using spans and reference variables, but you probably don't want to anyway in truly performance-critical code.

The fastest way to mutate a single reference to a TContainer is to use a hardcoded method:

// An array with just one item to mutate:
Box3D<double> [] array = [(new Box3D<double>() { Min = new() { X = 1, Y = 1, Z = 1 }, Max = new() { X = 2, Y = 2, Z = 2 } })];
double offset = 2.0;

// Add a value to the array item by item using a builtin hardcoded method 
for (int i = 0; i < array.Length; i++)
{
    NumericExtensions.AddHardcoded(ref array[i], offset);
}

The fastest generic way is to reinterpret the reference as a span of numbers, and mutate that without vectorization:

// Add a value to the array item by item by reference
for (int i = 0; i < array.Length; i++)
{
    NumericExtensions.AddToContainerRefNoVectorization(ref array[i], offset);
}

Creating an offset vector via new Vector<TNumber>(offset) apparently has some costs that does not pay off for a single addition.

The fastest way to mutate multiple TContainer values in a span by adding a fixed offset (if you can reorganize your algorithm to do this) depends on the number of items. Once the amount of data to mutate exceeds 2 to 3 the size of Vector<TNumber>.Count, vectorization becomes faster than transforming the span item by item. And once it exceeds 3 to 4 the size it's even faster than using the hardcoded method:
```
// An array of items larger than 4 * Vector<double>.Count:
Box3D<double> [] array = Enumerable.Repeat(new Box3D<double>() { Min = new() { X = 1, Y = 1, Z = 1 }, Max = new() { X = 2, Y = 2, Z = 2 } }, 4).ToArray();
double offset = 2.0;

// Add a value to the entire array at once using vectorization 
NumericExtensions.AddToContainerSpan(array.AsSpan(), offset);
```
The precise transition points seemed to depend on the size of the structs and specific type of number. They probably also depend on processor details so you will need to test for yourself. For large spans addition using vectorization was roughly 3 times faster than item-by-item generic addition, and twice as fast as using the hardcoded method item by item.
Using unsafe pointer arithmetic was never faster than using spans, much less vectorization. (Perhaps the Jitter can optimize the code better?)
Using a pre-initialized Vector<T> and passing it into the Add() method, either by reference or by value, was slower than constructing it internally. (This was also a little surprising to me.)

As suggested in comments, you might create some generic worker interface where you use generic methods to transform arbitrary struct types, but substitute in efficient hardcoded methods when present, e.g.:

public interface INumericWorker<TContainer, TNumber> where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
{
    void ModifyItem(ref TContainer item, TNumber value);
    void ModifySpan(Span<TContainer> items, TNumber value);
}

public sealed class GenericAddWorker<TContainer, TNumber> : INumericWorker<TContainer, TNumber> where TNumber : unmanaged, INumber<TNumber> where TContainer : unmanaged
{
    public void ModifyItem(ref TContainer item, TNumber value) => NumericExtensions.AddToContainerRefNoVectorization(ref item, value);
    public void ModifySpan(Span<TContainer> items, TNumber value) =>  NumericExtensions.AddToContainerSpan(items, value);
}

public sealed class Box3dDoubleAddAddWorker : INumericWorker<Box3D<double>, double>
{
    public void ModifyItem(ref Box3D<double> item, double value) => NumericExtensions.AddHardcoded(ref item, value);
    public void ModifySpan(Span<Box3D<double>> items, double value)
    {
        if (items.Length < 4)
        {
            foreach (ref var item in items)
                NumericExtensions.AddHardcoded(ref item, value);
        }
        else
        {
            NumericExtensions.AddToContainerSpan(items, value);
        }
    }
}

Selected benchmark results are as follows:

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update)
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.203
  [Host]     : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX2


| Method                                                 | ArrayCount | Mean         | Error       | StdDev      | Median       |
|------------------------------------------------------- |----------- |-------------:|------------:|------------:|-------------:|
| Benchmark_AddToContainerSpan                           | 1          |     5.030 ns |   0.1008 ns |   0.0894 ns |     5.066 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 1          |     4.945 ns |   0.0157 ns |   0.0123 ns |     4.944 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 1          |     5.426 ns |   0.0204 ns |   0.0181 ns |     5.429 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 1          |     2.528 ns |   0.0149 ns |   0.0139 ns |     2.523 ns |
| Benchmark_AddToContainerUnsafe                         | 1          |     3.217 ns |   0.0228 ns |   0.0190 ns |     3.221 ns |
| Benchmark_AddHardcoded                                 | 1          |     2.122 ns |   0.0408 ns |   0.0382 ns |     2.147 ns |
| Benchmark_AddToContainerSpan                           | 2          |     4.881 ns |   0.1224 ns |   0.1310 ns |     4.954 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 2          |    10.677 ns |   0.0259 ns |   0.0242 ns |    10.680 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 2          |     7.294 ns |   0.0529 ns |   0.0442 ns |     7.277 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 2          |     5.500 ns |   0.0186 ns |   0.0174 ns |     5.495 ns |
| Benchmark_AddToContainerUnsafe                         | 2          |     6.719 ns |   0.0255 ns |   0.0239 ns |     6.721 ns |
| Benchmark_AddHardcoded                                 | 2          |     4.550 ns |   0.0236 ns |   0.0221 ns |     4.542 ns |
| Benchmark_AddToContainerSpan                           | 3          |     6.683 ns |   0.0952 ns |   0.0890 ns |     6.696 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 3          |    14.438 ns |   0.0295 ns |   0.0261 ns |    14.434 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 3          |    10.221 ns |   0.2228 ns |   0.2084 ns |    10.113 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 3          |     8.758 ns |   0.0395 ns |   0.0370 ns |     8.771 ns |
| Benchmark_AddToContainerUnsafe                         | 3          |    10.260 ns |   0.1129 ns |   0.1056 ns |    10.313 ns |
| Benchmark_AddHardcoded                                 | 3          |     6.490 ns |   0.0424 ns |   0.0396 ns |     6.487 ns |
| Benchmark_AddToContainerSpan                           | 4          |     6.240 ns |   0.0089 ns |   0.0074 ns |     6.237 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 4          |    19.838 ns |   0.0596 ns |   0.0466 ns |    19.839 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 4          |    13.145 ns |   0.2296 ns |   0.2035 ns |    13.117 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 4          |    11.729 ns |   0.1569 ns |   0.1468 ns |    11.673 ns |
| Benchmark_AddToContainerUnsafe                         | 4          |    13.669 ns |   0.2959 ns |   0.2768 ns |    13.743 ns |
| Benchmark_AddHardcoded                                 | 4          |     8.255 ns |   0.0196 ns |   0.0174 ns |     8.252 ns |
| Benchmark_AddToContainerSpan                           | 5          |     8.353 ns |   0.1941 ns |   0.2235 ns |     8.367 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 5          |    25.891 ns |   0.5423 ns |   0.7239 ns |    26.297 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 5          |    16.649 ns |   0.1018 ns |   0.0850 ns |    16.656 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 5          |    15.326 ns |   0.1264 ns |   0.1056 ns |    15.366 ns |
| Benchmark_AddToContainerUnsafe                         | 5          |    18.024 ns |   0.0595 ns |   0.0527 ns |    18.040 ns |
| Benchmark_AddHardcoded                                 | 5          |    11.132 ns |   0.0524 ns |   0.0437 ns |    11.144 ns |
| Benchmark_AddToContainerSpan                           | 10         |    13.641 ns |   0.0261 ns |   0.0244 ns |    13.646 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 10         |    54.563 ns |   0.1820 ns |   0.1613 ns |    54.605 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 10         |    32.991 ns |   0.0676 ns |   0.0632 ns |    33.003 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 10         |    32.270 ns |   0.4208 ns |   0.3731 ns |    32.224 ns |
| Benchmark_AddToContainerUnsafe                         | 10         |    36.502 ns |   0.0796 ns |   0.0706 ns |    36.520 ns |
| Benchmark_AddHardcoded                                 | 10         |    23.832 ns |   0.0247 ns |   0.0206 ns |    23.832 ns |
| Benchmark_AddToContainerSpan                           | 1000       |   928.463 ns |  18.2194 ns |  24.3224 ns |   911.696 ns |
| Benchmark_AddToSingleItemContainerRefs                 | 1000       | 4,923.470 ns |   7.0736 ns |   6.2706 ns | 4,924.709 ns |
| Benchmark_AddToContainerSpanNoVectorization            | 1000       | 2,987.166 ns |  20.5200 ns |  17.1351 ns | 2,989.880 ns |
| Benchmark_AddToSingleItemContainerRefsNoVectorization  | 1000       | 3,134.353 ns |  45.2252 ns |  35.3089 ns | 3,131.778 ns |
| Benchmark_AddToContainerUnsafe                         | 1000       | 3,582.236 ns |   7.7057 ns |   6.8309 ns | 3,581.778 ns |
| Benchmark_AddHardcoded                                 | 1000       | 2,130.239 ns |  40.3144 ns |  37.7101 ns | 2,111.256 ns |

Full results + code here: https://dotnetfiddle.net/W6IJXh

^{^[1]There are also smaller SIMD types specifically for float values such as Vector2 however these do not meet your requirement of being generic.}