c#performance vectorization simd memory-bandwidth

Why is performance gain of C# SIMD low with larger arrays than tiny arrays?

I've been working on a Deep Learning Library writing on my own. In matrix operations, getting the best performance is a key for me. I've been researching about programming languages and their performances on numeric operations. After a while, I found that C# SIMD has very similar performance with C++ SIMD. So, I decided to write the library in C#.

Firstly, I tested C# SIMD (I tested a lot of things, however not gonna write here). I noticed that it worked a lot better when using smaller arrays. The efficiency not good when using bigger arrays. I think it is ridiculous. Normally things work faster in terms of efficiency when they are bigger.

My question is "Why does vectorization work slower working with bigger arrays in C#?"

I am going to share benchmarks (done by myself) using BenchmarkNet.

Program.Size = 10

| Method |      Mean |     Error |    StdDev |
|------- |----------:|----------:|----------:|
|     P1 |  28.02 ns | 0.5225 ns | 0.4888 ns |
|     P2 | 154.15 ns | 1.1220 ns | 0.9946 ns |
|     P3 | 100.88 ns | 0.8863 ns | 0.8291 ns |

Program.Size = 10000

| Method |     Mean |    Error |   StdDev |   Median |
|------- |---------:|---------:|---------:|---------:|
|     P1 | 142.0 ms | 3.065 ms | 8.989 ms | 139.5 ms |
|     P2 | 170.3 ms | 3.365 ms | 5.981 ms | 170.1 ms |
|     P3 | 103.3 ms | 2.400 ms | 2.245 ms | 102.8 ms |

So as you see I increase the size 1000 times, meaning increasing the size of arrays 1000000 times. P2 took 154 ns at first. At second test, It took 170 ms which is what we expected 1000-ish times more. Also, P3 took exactly 1000 times more (100ns - 100ms) However, what I wanna touch here is that P1 which is vectorized loop has significantly lower performance than before. I wonder why.

Note that P3 is independent from this topic. P1 is the vectorized version of P2. So, we can say the efficiency of vectorization is P2/P1 in terms of the time they took. My code is like below:

Matrix class:

public sealed class Matrix1
{
    public float[] Array;
    public int D1, D2;
    const int size = 110000000;
    private static ArrayPool<float> sizeAwarePool = ArrayPool<float>.Create(size, 100);

    public Matrix1(int d1, int d2)
    {
        D1 = d1;
        D2 = d2;
        if(D1*D2 > size)
        { throw new Exception("Size!"); }
        Array = sizeAwarePool.Rent(D1 * D2);
    }

    bool Deleted = false;
    public void Dispose()
    {
        sizeAwarePool.Return(Array);
        Deleted = true;
    }

    ~Matrix1()
    {
        if(!Deleted)
        {
            throw new Exception("Error!");
        }
    }

    public float this[int x, int y]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            return Array[x * D2 + y];
        }
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        set
        {
            Array[x * D2 + y] = value;
        }
    }
}

Program Class:

public class Program
{
    const int Size = 10000;

    [Benchmark]
    public void P1()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        int sz = Vector<float>.Count;
        for (int i = 0; i < Size * Size; i += sz)
        {
            var v1 = new Vector<float>(a.Array, i);
            var v2 = new Vector<float>(b.Array, i);
            var v3 = v1 + v2;
            v3.CopyTo(c.Array, i);
        }

    }

    [Benchmark]
    public void P2()
    {
        Matrix1 a = Program.a, b = Program.b, c = Program.c;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                c[i, j] = a[i, j] + b[i, j];
    }
    [Benchmark]
    public void P3()
    {
        Matrix1 a = Program.a;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i + j - j; 
                //could have written a.Array[i*size + j] = i + j
                //but it would have made no difference in terms of performance.
                //so leave it that way
    }


    public static Matrix1 a = new Matrix1(Size, Size);
    public static Matrix1 b = new Matrix1(Size, Size);
    public static Matrix1 c = new Matrix1(Size, Size);

    static void Main(string[] args)
    {
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                a[i, j] = i;
        for (int i = 0; i < Size; i++)
            for (int j = 0; j < Size; j++)
                b[i, j] = j;
        for (int i = 0; i < Size; i++)  
            for (int j = 0; j < Size; j++)
                c[i, j] = 0;

        var summary = BenchmarkRunner.Run<Program>();
        a.Dispose();
        b.Dispose();
        c.Dispose();
    }
}

I assure you that x[i,j] doesn't affect the performance. Same as using x.Array[i*Size + j]

Solution

This might not be the whole story: the OP reports in comments that they sped up P1 from 140 to 120 ms with jagged arrays.

So maybe something extra is holding it back in the large case. I'd use performance counters to investigate and check for ld_blocks_partial.address_alias (4k aliasing -> false dependency of loads on stores). And/or look at the memory addresses you get from C# allocators and maybe see if they're close to but not quite all the same alignment relative to a 4k boundary.

I don't think needing 3 hot cache lines in the same set would be a problem; L1d is 8-way associative on any CPU that would give >4x speedups with AVX (i.e. with 256-bit load/store and ALUs). But if all your arrays have the same alignment relative to a 4k boundary, they will all alias the same set in a 32kiB L1d cache when you access the same index.

Oh, here's a theory: Jagged arrays stagger the page walks, instead of all 3 streams (2 src 1 dst) reaching a new page at the same time and all having a TLB miss that requires a walk. Try making sure your code uses 2M hugepages instead of just 4k to reduce TLB misses. (e.g. on Linux you'd use a madvise(buf, size, MADV_HUGEPAGE) system call.)

Check performance counter events for dtlb_load_misses.miss_causes_a_walk and/or dtlb_load_misses.stlb_hit. There is TLB prefetch so having them staggered can allow TLB prefetch to work on one or two in parallel instead of getting hit with all 3 page walks at once.

Large sizes bottleneck on memory bandwidth, not just ALU

SIMD doesn't increase available memory bandwidth, just how quickly you can get data in/out of cache. It increases how much memory bandwidth you can actually use most of the time. Doing the same work in fewer instructions can help OoO exec see farther ahead and detect TLB misses sooner, though.

The speedup is with large arrays is limited because scalar is already close-ish to bottlenecked on main memory bandwidth. Your C[i] = A[i]+B[i] access pattern is the STREAM sum access pattern, maximal memory access for one ALU operation. (1D vs. 2D indexing is irrelevant, you're still just reading / writing contiguous memory and doing pure vertical SIMD float addition. Explicitly in the P1 case.)

With small matrices (10x10 = 100 float = 400 bytes * (2 sources + 1 dst) = 1.2kB), your data can stay hot in L1d cache so cache misses won't bottleneck your SIMD loop.

With your src + dst hot in L1d cache, you can get close to the full 8x speedup over scalar AVX with 8x 32-bit elements per vector, assuming a Haswell or later CPU that has peak load+store throughput of 2x 32-byte vectors loads + 1x 32-byte vector store per clock cycle.

In practice you got 154.15 / 28.02 = ~5.5 for the small-matrix case.

Actual cache limitations apparently preclude that, e.g. Intel's optimization manual lists ~81 bytes / clock cycle typical sustained load + store bandwidth for Skylake's L1d cache. But with GP-integer loads + stores, Skylake can sustain 2 loads + 1 store per cycle for 32-bit operand-size, with the right loop. So there's some kind of microarchitectural limit other than load/store uop throughput that slows down vector load/store somewhat.

You didn't say what hardware you have, but I'm guessing it's Intel Haswell or later. "Only" 5.5x speedup might be due to benchmark overhead for only doing 12 or 13 loop iterations per call.

(100 elements / 8 elem/vec = 12.5. So 12 if you leave the last 4 elements not done, or 13 if you overread by 4 because your loop condition isn't i < Size * Size - sz + 1)

Zen's 2x 16-byte memory ops per clock (up to one of which can be a store) would slow down both scalar and AVX equally. But you'd still get at best 4x speedup going from 1 element per vector with movss / addss xmm, mem / movss to the same uops doing 4 elements at once. Using 256-bit instructions on Zen 1 just means 2 uops per instruction, with the same 2 memory uops per clock throughput limit. Better front-end throughput from using 2-uop instructions, but that's not the bottleneck here. (Assuming the compiler can make a loop in 5 uops or less it can issue at 1 iter per clock, and couldn't even run that fast because of the back-end bottleneck on load/store ports.)

Those results would also make sense on a Zen 2, I think: 256-bit SIMD execution units and I think also load/store ports mean that you can expect up to 8x speedups when doing 8x the amount of work per instruction.