Accumulating Doubles Into Bins via intrinsics

I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.

A naive implementation is below:

const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS];    // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};

for (int i = 0; i < N_OBS; ++i) {
  acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}

Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.

Solution

Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).

No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.

void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
    // Count of OpenMP threads = CPU cores to use
    constexpr int ompThreadsCount = 4;

    // Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
    // Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
    alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
    memset( &accumulators, 0, sizeof( accumulators ) );

    // Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
    for( int i = 0; i < ompThreadsCount; i++ )
    {
        // Grab a slice of the output buffer
        double* const acc = &accumulators[ i * 16 ];

        // Compute a slice of the source data for this thread
        const size_t first = i * length / ompThreadsCount;
        const size_t last = ( i + 1 ) * length / ompThreadsCount;

        // Accumulate into thread-local portion of the buffer
        for( size_t i = first; i < last; i++ )
        {
            const int8_t idx = indices[ i ];
            acc[ idx ] += rsi[ i ];
        }
    }

    // Reduce 16*N scalars to 16 with a few AVX instructions
    for( int i = 0; i < 16; i += 4 )
    {
        __m256d v = _mm256_load_pd( &accumulators[ i ] );
        for( int j = 1; j < ompThreadsCount; j++ )
        {
            __m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
            v = _mm256_add_pd( v, v2 );
        }
        _mm256_storeu_pd( rdi + i, v );
    }
}

The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.

P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.