Implementing C# hardware intrinsics wrapper issue

I am trying to use the power of hardware intrinsic and just for test create one function based on Avx2 instructions and compare it with my current Vector implementation with no intrinsic at all.

When I did a benchmark of 2 functions doing the same, I was impressed that intrinsics function actually 2 times slower. I investigate that and found out that calculations itself ~3.8 times faster, but when I starting to create my wrapper structure and return the result, it actually where the most time is spend.

Here is my implementations for intrinsic method:

public static Vector4FHW Subtract(Vector4FHW left, Vector4FHW right)
{
    if (Avx2.IsSupported)
    {
        var left1 = Vector128.Create(left.X, left.Y, left.Z, left.W);
        var right1 = Vector128.Create(right.X, right.Y, right.Z, right.W);
        var result = Avx2.Subtract(left1, right1);
 
        var x = result.GetElement(0);
        var y = result.GetElement(1);
        var z = result.GetElement(2);
        var w = result.GetElement(3);
 
        return new Vector4FHW(x, y, z, w);
    }

    return default;
}

And here is my naive implementation of old Vector:

public static void Subtract(ref Vector3F left, ref Vector3F right, out Vector3F result)
{
   result = new Vector3F(left.X - right.X, left.Y - right.Y, left.Z - right.Z);
}

I made benchmark with BenchmarkDotNet where I call Subtract 1 000 000 times and here is my results:

With HW support I have ~3170 us, without - 970 us

And my main question: what I am doing wrong that creating C# struct with values takes soooo long comparing to my old implementation and/or I can do some additional optimizations here?

UPDATE

My Vector4FHW and Vector3F actually the same structures. They looks like this:

[StructLayout(LayoutKind.Sequential)]
public struct Vector4FHW
{
    public float X;

    public float Y;

    public float Z;

    public float W;

    public Vector4FHW(float x, float y, float z, float w)
    {
        X = x;
        Y = y;
        Z = z;
            W = w;
    }
    //...
}

And here is my tests. They also pretty simple:

[Benchmark]
public void SubtractBenchMarkAccelerated()
{
    for (int i = 0; i < 1000000; i++)
    {
        Vector4FHW.Subtract(new Vector4FHW(1, 20, 60, 15),new Vector4FHW(20, 48, 79, 19));
    }
}

[Benchmark]
public void SubtractBenchMark()
{
    for (int i = 0; i < 1000000; i++)
    {
        Vector4F.Subtract(new Vector4F(1, 20, 60, 15), new Vector4F(20, 48, 79, 19));
    }
}

Solution

You can make a single operation with 3+3 doubles this way

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D Substract(Vector3D left, Vector3D right)
{
    Vector256<double> v0 = Vector256.Create(left.X, left.Y, left.Z, 0);
    Vector256<double> v1 = Vector256.Create(right.X, right.Y, right.Z, 0);
    Vector256<double> result = Avx.Subtract(v0, v1);
    return new Vector3D(result.GetElement(0), result.GetElement(1), result.GetElement(2));
}

MethodImplOptions.AggressiveInlining tells compiler to embed method's code to the caller's body (if possible). No method call will be in output assembly but only calculations.

It might be faster but your test has 2 issues.

Don't check Avx2.IsSupported per operation, do it once per application life.
Don't create data in the loop, memory allocations makes the test slow and dirty.

Clean test can look like this

[Benchmark]
public void SubtractBenchMarkAccelerated()
{
    Vector3D vector1 = new Vector3D(1.5, 2.5, 3.5);
    Vector3D vector2 = new Vector3D(0.1, 0.2, 0.3);
    for (int i = 0; i < 1000000; i++)
    {
        Subtract(vector1, vector2);
    }
}

But the problem of the single operation if using only 75% of Vector256 capacity. Can it be faster by 25%? Yes, with more data.

That was only start of the story. Let's imagine that you want to calculate 4 groups of vectors at once. 4 against 4. Where the performance magic starts.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector3D[] SubstractArray(Vector3D[] left, Vector3D[] right)
{
    var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
    var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
    Vector3D[] result = new Vector3D[left.Length];
    var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);

    for (int i = 0; i < v0.Length; i++) // v0.Length = 3 here, not 4
    {
        r[i] = Avx.Subtract(v0[i], v1[i]);
    }

    return result;
}

MemoryMarshal.Cast doesn't copy anything, it just make Span<T> that pointing to the same memory as source array, thus it's lightning-fast. I tested.

The test can look like this.

[Benchmark]
public void SubtractBenchMarkAccelerated4()
{
    Vector3D[] array1 = new Vector3D[4];
    array1[0] = new Vector3D(1.5, 2.5, 3.5);
    array1[1] = new Vector3D(1.5, 2.5, 3.5);
    array1[2] = new Vector3D(1.5, 2.5, 3.5);
    array1[3] = new Vector3D(1.5, 2.5, 3.5);
    Vector3D[] array2 = new Vector3D[4];
    array2[0] = new Vector3D(0.1, 0.2, 0.3);
    array2[1] = new Vector3D(0.1, 0.2, 0.3);
    array2[2] = new Vector3D(0.1, 0.2, 0.3);
    array2[3] = new Vector3D(0.1, 0.2, 0.3);
    for (int i = 0; i < 1000000; i++)
    {
        SubstractArray(array1, array2);
    }
}

Calculate 4000000 of vectors in the ~same time as 1000000, why not? You can calculate any amount of vectors this way. Just be sure that doubles count % 4 == 0.

Can it be faster than above example? Yes but with unsafe code only.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector3D[] SubstractArrayUnsafe(Vector3D[] left, Vector3D[] right)
{
    var v0 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(left);
    var v1 = MemoryMarshal.Cast<Vector3D, Vector256<double>>(right);
    Vector3D[] result = new Vector3D[left.Length];
    var r = MemoryMarshal.Cast<Vector3D, Vector256<double>>(result);

    fixed (Vector256<double>* vPtr0 = v0, vPtr1 = v1, rPtr = r)
    {
        Vector256<double>* endPtr0 = vPtr0 + v0.Length;
        Vector256<double>* vPos0 = vPtr0;
        Vector256<double>* vPos1 = vPtr1;
        Vector256<double>* rPos = rPtr;
        while (vPos0 < endPtr0)
        {
            *rPos = Avx.Subtract(*vPos0, *vPos1);
            vPos0++;
            vPos1++;
            rPos++;
        }
    }
    return result;
}

You can substract this way not only Vector3D[] but your Vector4D[] or simply double[] arrays.

Also visit these useful pages: x86/x64 SIMD Instruction List (SSE to AVX512) and this one.

UPDATE

Optimizing single operation for packages of the same size

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
{
    var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
    var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
    var result = Avx.Subtract(left1, right1);
    return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
}

Let's Benchmark

class Program
{
    static void Main()
    {
        var summary = BenchmarkRunner.Run<MyBenchmark>();
        Console.ReadKey();
    }
}

[StructLayout(LayoutKind.Sequential)]
public struct Vector4DHW
{
    public double X;

    public double Y;

    public double Z;

    public double W;

    public Vector4DHW(double x, double y, double z, double w)
    {
        X = x;
        Y = y;
        Z = z;
        W = w;
    }
}

public class MyBenchmark
{
    private Vector4DHW vector1 = new Vector4DHW(1.5, 2.5, 3.5, 4.5);
    private Vector4DHW vector2 = new Vector4DHW(0.1, 0.2, 0.3, 0.4);

    [Benchmark]
    public void Loop()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var j = i;
        }
    }

    [Benchmark]
    public void Substract()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var result = Substract(ref vector1, ref vector2);
        }
    }

    [Benchmark]
    public void SubstractAvx()
    {
        for (int i = 0; i < 1000000; i++)
        {
            var result = SubstractAvx(ref vector1, ref vector2);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector4DHW Substract(ref Vector4DHW left, ref Vector4DHW right)
    {
        return new Vector4DHW(left.X - right.X, left.Y - right.Y, left.Z - right.Z, left.W - right.W);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector4DHW SubstractAvx(ref Vector4DHW left, ref Vector4DHW right)
    {
        var left1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref left);
        var right1 = Unsafe.As<Vector4DHW, Vector256<double>>(ref right);
        var result = Avx.Subtract(left1, right1);
        return Unsafe.As<Vector256<double>, Vector4DHW>(ref result);
    }
}

Go!

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4700HQ CPU 2.40GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.102
  [Host]     : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT
  DefaultJob : .NET Core 5.0.2 (CoreCLR 5.0.220.61120, CoreFX 5.0.220.61120), X64 RyuJIT

|       Method |       Mean |   Error |  StdDev |
|------------- |-----------:|--------:|--------:|
|         Loop |   317.6 us | 1.36 us | 1.21 us |
|    Substract | 1,427.0 us | 4.14 us | 3.46 us |
| SubstractAvx |   478.0 us | 1.58 us | 1.40 us |

As conclusion I can tell that memory optimisations are really important when you're trying to save few more microseconds in performance. And even Stack allocations are important regardless of its lightning-fast speed. Finally the for loop overhead consumes a lot of time from 478 microseconds. That's why I measured the Loop overhead separately.

Let's calculate the performance gain from AVX.

1,427.0 - 317.6 = 1109.4
478.0 - 317.6 = 160.4
1109.4 / 160.4 = 6.92

AVX is almost 7 times faster.

UPDATE2

Also test this

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public unsafe static Vector4DHW Substract(Vector4DHW left, Vector4DHW right)
{
    var result = Avx.Subtract(*(Vector256<double>*)&left, *(Vector256<double>*)&right);
    return *(Vector4DHW*)&result;
}