Suppose I have a very simple code like:
double array[SIZE_OF_ARRAY];
double sum = 0.0;
for (int i = 0; i < SIZE_OF_ARRAY; ++i)
{
sum += array[i];
}
I basically want to do the same operations using SSE2. How can I do that?
Here's a very simple SSE3 implementation:
#include <emmintrin.h>
__m128d vsum = _mm_set1_pd(0.0);
for (int i = 0; i < n; i += 2)
{
__m128d v = _mm_load_pd(&a[i]);
vsum = _mm_add_pd(vsum, v);
}
vsum = _mm_hadd_pd(vsum, vsum);
double sum = _mm_cvtsd_f64(vsum0);
You can unroll the loop to get much better performance by using multiple accumulators to hide the latency of FP addition (as suggested by @Mysticial). Unroll 3 or 4 times with multiple "sum" vectors to bottleneck on load and FP-add throughput (one or two per clock cycle) instead of FP-add latency (one per 3 or 4 cycles):
__m128d vsum0 = _mm_setzero_pd();
__m128d vsum1 = _mm_setzero_pd();
for (int i = 0; i < n; i += 4)
{
__m128d v0 = _mm_load_pd(&a[i]);
__m128d v1 = _mm_load_pd(&a[i + 2]);
vsum0 = _mm_add_pd(vsum0, v0);
vsum1 = _mm_add_pd(vsum1, v1);
}
vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator
vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register
double sum = _mm_cvtsd_f64(vsum0);
Note that the array a
is assumed to be 16 byte aligned and the number of elements n
is assumed to be a multiple of 2 (or 4, in the case of the unrolled loop).
See also Fastest way to do horizontal float vector sum on x86 for alternate ways of doing the horizontal sum outside the loop. SSE3 support is not totally universal (especially AMD CPUs were later to support it than Intel).
Also, _mm_hadd_pd
is usually not the fastest way even on CPUs that support it, so an SSE2-only version won't be worse on modern CPUs. It's outside the loop and doesn't make much difference either way, though.