Search code examples
cintrinsicsavxavx2avx512

Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2


I am looking for an optimal method to calculate sum of all packed 32-bit integers in a __m256i or __m512i. To calculate sum of n elements, I ofter use log2(n) vpaddd and vpermd function, then extract the final result. Howerver, it is not the best option I think.

Edit: best/optimal in term of speed/cycle reduction.


Solution

  • Related: if you're looking for the non-existant _mm512_reduce_add_epu8, see Summing 8-bit integers in __m512i with AVX intrinsics vpsadbw as an hsum within qwords is much more efficient than shuffling.

    Without AVX512, see hsum_8x32(__m256i) below for AVX2 without Intel's reduce_add helper function. reduce_add doesn't necessarily compile optimally anyway with AVX512.


    There is a int _mm512_reduce_add_epi32(__m512i) inline function in immintrin.h. You might as well use it. (It compiles to shuffle and add instructions, but more efficient ones than vpermd, like I describe below.) AVX512 didn't introduce any new hardware support for horizontal sums, just this new helper function. It's still something to avoid or sink out of loops whenever possible.

    GCC 9.2 -O3 -march=skylake-avx512 compiles a wrapper that calls it as follows:

            vextracti64x4   ymm1, zmm0, 0x1
            vpaddd  ymm1, ymm1, ymm0
            vextracti64x2   xmm0, ymm1, 0x1   # silly compiler, vextracti128 would be shorter
            vpaddd  xmm1, xmm0, xmm1
            vpshufd xmm0, xmm1, 78
            vpaddd  xmm0, xmm0, xmm1
    
            vmovd   edx, xmm0
            vpextrd eax, xmm0, 1              # 2x xmm->integer to feed scalar add.
            add     eax, edx
            ret
    

    Extracting twice to feed scalar add is questionable; it needs uops for p0 and p5 so it's equivalent to a regular shuffle + a movd.

    Clang doesn't do that; it does one more step of shuffle / SIMD add to reduce down to a single scalar for vmovd. See below for perf analysis of the two.


    There is a VPHADDD but you should never use it with both inputs the same. (Unless you're optimizing for code-size over speed). It can be useful to transpose-and-sum multiple vectors, resulting in some vectors of results. You do that by feeding phadd with 2 different inputs. (Except it gets messy with 256 and 512-bit because vphadd is still only in-lane.)

    Yes, you need log2(vector_width) shuffles and vpaddd instructions. (So this isn't very efficient; avoid horizontal sums inside inner loops. Accumulate vertically until the end of a loop, for example).


    General strategy for all SSE / AVX / AVX512

    You want to successively narrow from 512 -> 256, then 256 -> 128, then shuffle within __m128i until you're down to one scalar element. Presumably some future AMD CPU will decode 512-bit instructions to two 256-bit uops, so reducing width is a big win there. And narrower instructions presumably cost slightly less power.

    Your shuffles can take immediate control operands, not vectors for vpermd. e.g. VEXTRACTI32x8, vextracti128, and vpshufd. (Or vpunpckhqdq to save code size for the immediate constant.)

    See Fastest way to do horizontal SSE vector sum (or other reduction) (my answer also includes some integer versions).

    This general strategy is appropriate for all element types: float, double, and any size integer

    Special cases:

    • 8-bit integer: start with vpsadbw, more efficient and avoids overflow, but then continue as for 64-bit integers.

    • 16-bit integer: start by widening to 32 with pmaddwd (_mm256_madd_epi16 with set1_epi16(1)) : SIMD: Accumulate Adjacent Pairs - fewer uops even if you don't care about the avoiding-overflow benefit, except on AMD before Zen2 where 256-bit instructions cost at least 2 uops. But then you continue as for 32-bit integer.

    32-bit integer can be done manually like this, with an SSE2 function called by the AVX2 function after reducing to __m128i, in turn called by the AVX512 function after reducing to __m256i. The calls will of course inline in practice.

    #include <immintrin.h>
    #include <stdint.h>
    
    // from my earlier answer, with tuning for non-AVX CPUs removed
    // static inline
    uint32_t hsum_epi32_avx(__m128i x)
    {
        __m128i hi64  = _mm_unpackhi_epi64(x, x);           // 3-operand non-destructive AVX lets us save a byte without needing a movdqa
        __m128i sum64 = _mm_add_epi32(hi64, x);
        __m128i hi32  = _mm_shuffle_epi32(sum64, _MM_SHUFFLE(2, 3, 0, 1));    // Swap the low two elements
        __m128i sum32 = _mm_add_epi32(sum64, hi32);
        return _mm_cvtsi128_si32(sum32);       // movd
    }
    
    // only needs AVX2
    uint32_t hsum_8x32(__m256i v)
    {
        __m128i sum128 = _mm_add_epi32( 
                     _mm256_castsi256_si128(v),
                     _mm256_extracti128_si256(v, 1)); // silly GCC uses a longer AXV512VL instruction if AVX512 is enabled :/
        return hsum_epi32_avx(sum128);
    }
    
    // AVX512
    uint32_t hsum_16x32(__m512i v)
    {
        __m256i sum256 = _mm256_add_epi32( 
                     _mm512_castsi512_si256(v),  // low half
                     _mm512_extracti64x4_epi64(v, 1));  // high half.  AVX512F.  32x8 version is AVX512DQ
        return hsum_8x32(sum256);
    }
    

    Notice that this uses __m256i hsum as a building block for __m512i; there's nothing to be gained by doing in-lane operations first.

    Well possibly a very tiny advantage: in-lane shuffles have lower latency than lane-crossing, so they could execute 2 cycles earlier and leave the RS earlier, and similarly retire from the ROB slightly earlier. But the higher-latency shuffles are coming just a couple instructions later even if you did that. So you might get a handful of some independent instructions into the back-end 2 cycles earlier if this hsum was on the critical path (blocking retirement).

    But reducing to a narrower vector width sooner is generally good, maybe getting 512-bit uops out of the system sooner so the CPU can re-activate the SIMD execution units on port 1, if you aren't doing more 512-bit work right away.

    Compiles on Godbolt to these instructions, with GCC9.2 -O3 -march=skylake-avx512

    hsum_16x32(long long __vector(8)):
            vextracti64x4   ymm1, zmm0, 0x1
            vpaddd  ymm0, ymm1, ymm0
            vextracti64x2   xmm1, ymm0, 0x1   # silly compiler uses a longer EVEX instruction when its available (AVX512VL)
            vpaddd  xmm0, xmm0, xmm1
            vpunpckhqdq     xmm1, xmm0, xmm0
            vpaddd  xmm0, xmm0, xmm1
            vpshufd xmm1, xmm0, 177
            vpaddd  xmm0, xmm1, xmm0
            vmovd   eax, xmm0
            ret
    

    P.S.: perf analysis of GCC's _mm512_reduce_add_epi32 vs. clang's (which is equivalent to my version), using data from https://uops.info/ and/or Agner Fog's instruction tables:

    After inlining into a caller that does something with the result, it could allow optimizations like adding a constant as well using lea eax, [rax + rdx + 123] or something.

    But other than that it seems almost always worse than the the shuffle / vpadd / vmovd at the end of my implementation, on Skylake-X:

    • total uops: reduce: 4. Mine: 3
    • ports: reduce: 2p0, p5 (part of vpextrd), p0156 (scalar add)
    • ports: mine: p5, p015 (vpadd on SKX), p0 (vmod)

    Latency is equal at 4 cycles, assuming no resource conflicts:

    • shuffle 1 cycle -> SIMD add 1 cycle -> vmovd 2 cycles
    • vpextrd 3 cycles (in parallel with 2 cycle vmovd) -> add 1 cycle.