C# .Net SIMD System.Numerics.Vector4 slower than loop

I wrote the following code to experiment with System.Numerics.Vector4 and evaluate the performance gain:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Numerics;

namespace ConsoleApp8
{
    class Program
    {
        static void Main(string[] args)
        {
            const int N = 100000000;
            long ticks_start, ticks_end;

            ticks_start = DateTime.Now.Ticks;

            float[] a = { 10, 10, 10, 10 };
            float[] b = new float[4];

            for (int i = 0; i < N; i++)
                for (int j = 0; j < 4; j++)
                    b[j] = a[j] + a[j];

            ticks_end = DateTime.Now.Ticks;


            Console.WriteLine($"Done in {ticks_end - ticks_start} ticks");

            ticks_start = DateTime.Now.Ticks;

            Vector4 result;
            Vector4 v = new Vector4();
            for (int i = 0; i < N; i++)
            {
                v.W = a[0];
                v.X = a[1];
                v.Y = a[2];
                v.Z = a[3];
                result = Vector4.Add(v, v);
                b[0] = result.W;
                b[1] = result.X;
                b[2] = result.Y;
                b[3] = result.Z;
            }
            ticks_end = DateTime.Now.Ticks;
            Console.WriteLine($"Done in {ticks_end - ticks_start} ticks");

            Console.ReadKey();
        }
    }
}

The output is:

Done in 14257591 ticks
Done in 18591588 ticks

So it seems that we get no advantage using Vector4. The Add method returns a new instance of Vector4. Is there a way to mutate one of the vectors to avoid the memory allocation impact? Or maybe there is another way to do things?

Solution

I haven't really benchmarked it, but this, in the inner loop:

for (int i = 0; i < N; i++)
{

    v.W = a[0];
    v.X = a[1];
    v.Y = a[2];
    v.Z = a[3];
    result = Vector4.Add(v, v);
    b[0] = result.W;
    b[1] = result.X;
    b[2] = result.Y;
    b[3] = result.Z;
}

Would be more equivalent to:


float[] a = { 10, 10, 10, 10 };
float[] b = new float[4];
float[] v = new float[4];
float[] result = new float[4];

for (int i = 0; i < N; i++)
{
    v[0] = a[0];
    v[1] = a[1];
    v[2] = a[2];
    v[3] = a[3];
    result[0] = v[0] + v[0];
    result[1] = v[1] + v[1];
    result[2] = v[2] + v[2];
    result[3] = v[3] + v[3];
    b[0] = result[0];
    b[1] = result[1];
    b[2] = result[2];
    b[3] = result[3];
}

Than what you've written (you are making 4 assignations -assigning to v-, then the addition, then another 4 assignations -assigning result to b-, which you are just skipping on your array operation).

I just tested it on linqpad using stopwatches (this is by no means a benchmark), and if you do this, it's slower with the array addition (even with the sum unrolled) than it is with Vector4 (by a very tight margin).