Search code examples
c++ssesimdintrinsicsavx

SIMD: Accumulate Adjacent Pairs


I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.

I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.

template <typename T, typename U>
inline void accumulate_2x2_x_pass(
  T* channel, U* accum,
  const size_t sx, const size_t sy, 
  const size_t osx, const size_t osy,
  const size_t yoff, const size_t oyoff
) {

  const bool odd_x = (sx & 0x01);

  size_t i_idx, o_idx;

  // Should be vectorizable somehow...
  for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) {
    i_idx = x + yoff;
    o_idx = ox + oyoff;
    accum[o_idx] += channel[i_idx];
    accum[o_idx] += channel[i_idx + 1];
  }

  if (odd_x) {
    // << 1 bc we need to multiply by two on the edge 
    // to avoid darkening during render
    accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
  }
}

However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.

(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)

Thanks everyone. :)


Solution

  • The widening case with the narrow type T = uint8_t or uint16_t is probably best implemented with SSSE3 pmaddubsw or SSE2 pmaddwd with a multiplier of 1. (Intrinsics guide) Those instructions are single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.

    If you're horizontally summing across 8 or more u8 elements, use psadbw (sum of absolute differences) against a zeroed vector. _mm_sad_epu8 to get sums at the bottom of each 64-bit element. (A good first step if hsumming a whole vector, or summing a whole u8 array without overflow.)

    If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t can't overflow). Load and vertical-add have 2 per clock throughput (or more) on most CPUs, vs. 1 per clock for pmadd* only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)

    And ideally you don't need to += into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.

    // SSSE3
    __m128i hadd_widen8_to_16(__m128i a) {
                          // uint8_t, int8_t  (doesn't matter when multiplier is +1)
        return _mm_maddubs_epi16(a, _mm_set_epi8(1));
    }
    
    // SSE2
    __m128i hadd_widen16_to_32(__m128i a) {
                       // int16_t, int16_t
        return _mm_madd_epi16(a, _mm_set_epi16(1));
    }
    

    These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.

    Yes really, they're both _epi16. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw = unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd is packed multiply add word to dword, same naming scheme as punpcklwd etc.)


    The T=U case with uint16_t or uint32_t is a a use-case for SSSE3 _mm_hadd_epi16 or _mm_hadd_epi32. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.

    If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps (_mm_shuffle_ps + some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck.

    // UNTESTED
    
    //Only any good with AVX, otherwise the extra movdqa instructions cost a lot of front-end throughput and ROB capacity
    //Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
    __m128i hadd32_emulated(__m128i a, __m128i b) {
        __m128i a_shift = _mm_srli_epi64(a, 32);
        __m128i b_shift = _mm_srli_epi64(b, 32);
        a = _mm_add_epi32(a, a_shift);
        b = _mm_add_epi32(b, b_shift);
        __m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
        return _mm_castps_si128(combined);
    }
    

    Ice Lake and later Intel, and AMD, have better than 1/clock throughput for some shuffles but not all. For example, vphaddd is 2 uops for p1 or p5 on Ice Lake or later, plus 1 uop for any of p015. But vhaddps's 2 shuffle uops can only run on port 5, even though you can precisely emulate vhaddps with 2x vshufps (with _MM_SHUFFLE(2,0,2,0) for evens and _MM_SHUFFLE(3,1,3,1) for odds, as shown in another answer), and vshufps runs on p1/p5 (unlike vunpcklps for some reason). And AMD CPUs run vhaddps less efficiently than you'd expect as well. So if you have a lot of pairwise summing of FP data, you might consider manually using shuffles that clang won't "optimize" into vhaddps.

    But for integer vphaddd on Ice Lake / Alder Lake and Zen 3 / Zen 4, things seem ok. At least with YMM vectors; Zen 3/4 at least take an extra uop for vphaddd xmm, xmm, xmm according to https://uops.info/ measurements.


    For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd result. So emulating hadd with shifts might be a bigger win.

    // 3x shuffle 1x add uops
    __m256i hadd32_avx2(__m256i a, __m256i b) {
        __m256i hadd = _mm256_hadd_epi32(a, b);  // 2x in-lane hadd
        return _mm256_permute4x_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
    }
    
    // UNTESTED
    // 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
    __m256i hadd32_emulated_avx2(__m256i a, __m256i b)
    {
            __m256i a_shift = _mm256_srli_epi64(a, 32);  // useful result in the low half of each qword
            __m256i b_shift = _mm256_slli_epi64(b, 32);  // ... high half of each qword
            a = _mm256_add_epi32(a, a_shift);
            b = _mm256_add_epi32(b, b_shift);
            __m256i blended = _mm256_blend_epi32(a,b, 0b10101010);  // alternating low/high results
            return _mm256_permutevar8x32_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0),  blended);
    }
    

    On Haswell and Skylake, hadd32_emulated_avx2 can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32 to sum into accum[] will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.

    hadd32_avx2 can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32 uops to implement your loop can run in the shadow of that easily.

    (https://agner.org/optimize/, https://uops.info/, and see https://stackoverflow.com/tags/x86/info)

    AVX-512 vpermt2d could let us skip the blend step, or we could do it with merge-masking. And/or we could vprolq to swap 32-bit halves within 64-bit elements to get the same sum in two positions, but we'd still just need a merge-masking add to replace the blend. vpermt2d is probably best; I think we can't avoid a shuffle-control vector, and that lets us avoid setting up any merge-masking constants.