SIMD vectorization strategies for group-by operations on multiple, very large data arrays

I have to do a large number of aggregation operations, with the output grouped by some dimension (int/byte ID). I'm using C#, but hopefully I can still get good advice from the majority C++ crowd reading this :)

A simplified version is below:

        public static (double[], double[]) AggregateDataGroupBy(double[] data, double[] weight, byte[] dimension)
        {
            int numberOfValues = byte.MaxValue - byte.MinValue + 1;
            double[] totalValue = new double[numberOfValues];
            double[] totalWeight = new double[numberOfValues];

            for (int i = 0; i < data.Length; i++)
            {
                byte index = dimension[i];
                totalValue[index] += data[i];
                totalWeight[index] += weight[i];
            }

            return (totalValue, totalWeight);
        }

SIMD vectorization gives a significant speed-up when no needing to group over the dimension. My first attempt to vectorise the operation was to grab the running totals for the dimensions of the 4 rows being processed, load the input vectors using a gather, do the aggregation functions and then scatter back. As scatter isn't part of AVX2, this last part is particularly slow.

        public static unsafe (double[], double[]) AggregateDataGather(double[] data, double[] weight, int[] dimension)
        {
            int numberOfValues = 256;
            double[] totalValue = new double[numberOfValues];
            double[] totalWeight = new double[numberOfValues];

            if (Avx2.IsSupported)
            {
                int vectorSize = 256 / 8 / sizeof(double);
                int i;
                fixed (double* ptr = data, ptr2 = weight, ptrValue = totalValue, ptrWeight = totalWeight)
                {
                    fixed (int* dimptr = dimension)
                    {
                        var accValue = stackalloc double[vectorSize];
                        var accWeight = stackalloc double[vectorSize];
                        for (i = 0; i <= data.Length - vectorSize; i += vectorSize)
                        {
                            Vector128<int> indices = Avx2.LoadVector128(dimptr + i);
                            var accVectorV = Avx2.GatherVector256(ptrValue, indices, 8);
                            var accVectorW = Avx2.GatherVector256(ptrWeight, indices, 8);
                            var v = Avx2.LoadVector256(ptr + i);
                            var w = Avx2.LoadVector256(ptr2 + i);
                            accVectorV = Avx2.Add(accVectorV, v);
                            accVectorW = Avx2.Add(accVectorW, w);

                            Avx2.Store(accValue, accVectorV);
                            Avx2.Store(accWeight, accVectorW);
                            for (int ii = 0; ii < vectorSize; ii++)
                            {
                                var index = dimptr[i + ii];
                                totalValue[index] = accValue[ii];
                                totalWeight[index] = accWeight[ii];
                            }
                        }
                    }
                }
            }

            else if (Avx.IsSupported || Sse42.IsSupported)
            {
                // Do other stuff
            }

            return (totalValue, totalWeight);
        }

(Please excuse the change of dimension from byte to int - I tested both and both are slower)

The intrinsics version above runs slower than the naive algorithm on my Ryzen 3600. (268ms for 100m values, rather than 230ms)

Given that my data only changes after many aggregations (over hundreds/thousands of different dimensions), I find that my fastest implementation can be to store the data (value, weight) in a vector and do a naive group-by. This gives similar performance on the Ryzen, but is 10% faster on an older i7 (without AVX).

        public static Vector2[] AggregateData(Vector2[] data, byte[] dimension)
        {
            int numberOfValues = byte.MaxValue - byte.MinValue + 1;
            Vector2[] sum = new Vector2[numberOfValues];

            for (int i = 0; i < data.Length; i++)
            {
                sum[dimension[i]] += data[i];
            }

            return sum;
        }

I'd read some papers on histogram functions that simply count the number of occurences of each dimension. They got a close to perfect 8x speed up compared to naive approaches.

Have I missed something in my attempt to use AVX2 intrinsics? Am I always going to be faced with inefficient gather/scatter operations? Any comments/suggestions?

As a sub-case, are there strategies that will only work when the dimension size is small (processing 4 dimension values at a time)? E.g. loading the value into a vector with a single non-zero value, as follows, and optimising the number of rows being processed at a time to use all the cache memory.

Values (11, 12, 13, 14, 15, 16, 17)
Indicies (1, 0, 3, 1, 2, 0, 3)
=> 
  <0, 11, 0, 0>
+ <12, 0, 0, 0>
+ <0, 0, 0, 13>
+ <0, 14, 0, 0>
+ <0, 0, 15, 0>
+ <16, 0, 0, 0>
+ <0, 0, 0, 17>

(This didn't appear to me a likely solution, due to inefficiency once the dimension size increases. So I haven't tried it yet, but I will if it's suggested as an efficient work-around.)

Solution

As said in the comments, vectorization is hard for such aggregation use case. However, it does not mean SIMD is completely useless for your problem. Try this version (untested).

The main idea, this version saves 50% of random loads/stores spent updating the accumulators. It interleaves the accumulators in memory, uses 128-bit load/add/store instructions, and splits the result back into 2 C# arrays after it consumed all input values.

static unsafe void aggregateSse2( double* accumulators, double* a, double* b, byte* dimension, int count )
{
    Debug.Assert( count >= 0 );
    double* aEnd = a + ( count & ( ~1 ) );
    while( a < aEnd )
    {
        // Load accumulator corresponding to the first bucket
        double* accPointer = accumulators + ( 2u * dimension[ 0 ] );
        Vector128<double> acc = Sse2.LoadAlignedVector128( accPointer );

        // Load 2 values from each input array.
        // BTW, possible to use AVX and unroll by 4 instead of 2, using GetLow/GetHigh to extract the 16-byte pieces.
        // Gonna save a bit of loads at the cost of more shuffles, might be slightly faster overall.
        Vector128<double> va = Sse2.LoadVector128( a );
        Vector128<double> vb = Sse2.LoadVector128( b );

        // Increment accumulator with the first value of each array, store back to RAM
        acc = Sse2.Add( acc, Sse2.UnpackLow( va, vb ) );
        Sse2.StoreAligned( accPointer, acc );

        // Load accumulator corresponding to the second bucket.
        // Potentially it might be the same pointer, can't load both in advance.
        accPointer = accumulators + ( 2u * dimension[ 1 ] );
        acc = Sse2.LoadAlignedVector128( accPointer );
        a += 2;
        b += 2;
        dimension += 2;

        // Increment accumulator with the second value of each array, store back to RAM
        acc = Sse2.Add( acc, Sse2.UnpackHigh( va, vb ) );
        Sse2.StoreAligned( accPointer, acc );
    }

    if( 0 != ( count & 1 ) )
    {
        // The input size was odd number, one item left at these pointers.

        // Load a scalar from first input array into lower lane of a vector
        Vector128<double> vec = Sse2.LoadScalarVector128( a );

        // Load the accumulator corresponding to the bucket
        double* accPointer = accumulators + ( 2u * dimension[ 0 ] );
        Vector128<double> acc = Sse2.LoadAlignedVector128( accPointer );

        // Load scalar from second input array into higher lane of that vector
        vec = Sse2.LoadHigh( vec, b );

        // Increment accumulator and store back to RAM
        acc = Sse2.Add( acc, vec );
        Sse2.StoreAligned( accPointer, acc );
    }
}

static unsafe void splitAccumulators( double* values, double* weights, double* accumulators, int numberOfValues )
{
    double* end = accumulators + numberOfValues * 2;
    while( accumulators < end )
    {
        Vector128<double> vec = Sse2.LoadAlignedVector128( accumulators );
        accumulators += 2;
        Sse2.StoreScalar( values, vec );
        values++;
        Sse2.StoreHigh( weights, vec );
        weights++;
    }
}

/// <summary>Align pointer by 16 bytes, rounding up.</summary>
[MethodImpl( MethodImplOptions.AggressiveInlining )]
static unsafe void* roundUpBy16( void* pointer )
{
    if( Environment.Is64BitProcess )  // This branch is missing from JIT output BTW, it's free.
    {
        long a = (long)pointer;
        a = ( a + 15L ) & ( -16L );
        return (void*)a;
    }
    else
    {
        int a = (int)pointer;
        a = ( a + 15 ) & ( -16 );
        return (void*)a;
    }
}

[SkipLocalsInit] // Otherwise the runtime gonna zero-initialize the stack allocated buffer, very slowly with `push 0` instructions in a loop.
public static (double[], double[]) AggregateDataSse2( double[] data, double[] weight, byte[] dimension )
{
    Debug.Assert( data.Length == weight.Length && data.Length == dimension.Length );

    const int numberOfValues = 0x100;
    unsafe
    {
        // The buffer is about 4kb RAM, fits in L1D cache.
        // Allocating 2 extra doubles (16 extra bytes) to align the pointer.
        double* accumulators = stackalloc double[ ( numberOfValues * 2 ) + 2 ];
        // Align by 16 bytes
        accumulators = (double*)roundUpBy16( accumulators );
        // Clear accumulators with zeros, let's hope the implementation of that standard library method is good.
        new Span<double>( accumulators, numberOfValues * 2 ).Fill( 0 );

        // Process the input data
        fixed( double* a = data )
        fixed( double* b = weight )
        fixed( byte* dim = dimension )
            aggregateSse2( accumulators, a, b, dim, data.Length );

        // Split the result into 2 arrays
        double[] totalValue = new double[ numberOfValues ];
        double[] totalWeight = new double[ numberOfValues ];
        fixed( double* values = totalValue )
        fixed( double* weights = totalWeight )
            splitAccumulators( values, weights, accumulators, numberOfValues );

        return (totalValue, totalWeight);
    }
}

It only uses SSE2 because it doesn’t quite need wider ones, but still, should save non-trivial amount of instructions and RAM transactions compared to your scalar version. I would expect some measurable improvement on all computers.