Search code examples
c++sumssesimdreduction

SSE reduction of float vector


How can I get sum elements (reduction) of float vector using sse intrinsics?

Simple serial code:

void(float *input, float &result, unsigned int NumElems)
{
     result = 0;
     for(auto i=0; i<NumElems; ++i)
         result += input[i];
}

Solution

  • Typically you generate 4 partial sums in your loop and then just sum horizontally across the 4 elements after the loop, e.g.

    #include <cassert>
    #include <cstdint>
    #include <emmintrin.h>
    
    float vsum(const float *a, int n)
    {
        float sum;
        __m128 vsum = _mm_set1_ps(0.0f);
        assert((n & 3) == 0);
        assert(((uintptr_t)a & 15) == 0);
        for (int i = 0; i < n; i += 4)
        {
            __m128 v = _mm_load_ps(&a[i]);
            vsum = _mm_add_ps(vsum, v);
        }
        vsum = _mm_hadd_ps(vsum, vsum);
        vsum = _mm_hadd_ps(vsum, vsum);
        _mm_store_ss(&sum, vsum);
        return sum;
    }
    

    Note: for the above example a must be 16 byte aligned and n must be a multiple of 4. If the alignment of a can not be guaranteed then use _mm_loadu_ps instead of _mm_load_ps. If n is not guaranteed to be a multiple of 4 then add a scalar loop at the end of the function to accumulate any remaining elements.