Search code examples
c++simdavxavx2

Accumulating Doubles Into Bins via intrinsics


I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.

A naive implementation is below:

const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS];    // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};

for (int i = 0; i < N_OBS; ++i) {
  acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}

Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.


Solution

  • Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).

    No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.

    void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
    {
        // Count of OpenMP threads = CPU cores to use
        constexpr int ompThreadsCount = 4;
    
        // Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
        // Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
        alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
        memset( &accumulators, 0, sizeof( accumulators ) );
    
        // Minimize OMP overhead by dispatching very few large tasks
    #pragma omp parallel for schedule(static, 1)
        for( int i = 0; i < ompThreadsCount; i++ )
        {
            // Grab a slice of the output buffer
            double* const acc = &accumulators[ i * 16 ];
    
            // Compute a slice of the source data for this thread
            const size_t first = i * length / ompThreadsCount;
            const size_t last = ( i + 1 ) * length / ompThreadsCount;
    
            // Accumulate into thread-local portion of the buffer
            for( size_t i = first; i < last; i++ )
            {
                const int8_t idx = indices[ i ];
                acc[ idx ] += rsi[ i ];
            }
        }
    
        // Reduce 16*N scalars to 16 with a few AVX instructions
        for( int i = 0; i < 16; i += 4 )
        {
            __m256d v = _mm256_load_pd( &accumulators[ i ] );
            for( int j = 1; j < ompThreadsCount; j++ )
            {
                __m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
                v = _mm256_add_pd( v, v2 );
            }
            _mm256_storeu_pd( rdi + i, v );
        }
    }
    

    The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.

    P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.