Search code examples
c#vectorizationsimd

C# .Net SIMD System.Numerics.Vector4 slower than loop


I wrote the following code to experiment with System.Numerics.Vector4 and evaluate the performance gain:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Numerics;

namespace ConsoleApp8
{
    class Program
    {
        static void Main(string[] args)
        {
            const int N = 100000000;
            long ticks_start, ticks_end;

            ticks_start = DateTime.Now.Ticks;

            float[] a = { 10, 10, 10, 10 };
            float[] b = new float[4];

            for (int i = 0; i < N; i++)
                for (int j = 0; j < 4; j++)
                    b[j] = a[j] + a[j];

            ticks_end = DateTime.Now.Ticks;


            Console.WriteLine($"Done in {ticks_end - ticks_start} ticks");

            ticks_start = DateTime.Now.Ticks;

            Vector4 result;
            Vector4 v = new Vector4();
            for (int i = 0; i < N; i++)
            {
                v.W = a[0];
                v.X = a[1];
                v.Y = a[2];
                v.Z = a[3];
                result = Vector4.Add(v, v);
                b[0] = result.W;
                b[1] = result.X;
                b[2] = result.Y;
                b[3] = result.Z;
            }
            ticks_end = DateTime.Now.Ticks;
            Console.WriteLine($"Done in {ticks_end - ticks_start} ticks");

            Console.ReadKey();
        }
    }
}

The output is:

Done in 14257591 ticks
Done in 18591588 ticks

So it seems that we get no advantage using Vector4. The Add method returns a new instance of Vector4. Is there a way to mutate one of the vectors to avoid the memory allocation impact? Or maybe there is another way to do things?


Solution

  • I haven't really benchmarked it, but this, in the inner loop:

    for (int i = 0; i < N; i++)
    {
    
        v.W = a[0];
        v.X = a[1];
        v.Y = a[2];
        v.Z = a[3];
        result = Vector4.Add(v, v);
        b[0] = result.W;
        b[1] = result.X;
        b[2] = result.Y;
        b[3] = result.Z;
    }
    
    

    Would be more equivalent to:

    
    float[] a = { 10, 10, 10, 10 };
    float[] b = new float[4];
    float[] v = new float[4];
    float[] result = new float[4];
    
    for (int i = 0; i < N; i++)
    {
        v[0] = a[0];
        v[1] = a[1];
        v[2] = a[2];
        v[3] = a[3];
        result[0] = v[0] + v[0];
        result[1] = v[1] + v[1];
        result[2] = v[2] + v[2];
        result[3] = v[3] + v[3];
        b[0] = result[0];
        b[1] = result[1];
        b[2] = result[2];
        b[3] = result[3];
    }
    

    Than what you've written (you are making 4 assignations -assigning to v-, then the addition, then another 4 assignations -assigning result to b-, which you are just skipping on your array operation).

    I just tested it on linqpad using stopwatches (this is by no means a benchmark), and if you do this, it's slower with the array addition (even with the sum unrolled) than it is with Vector4 (by a very tight margin).