Search code examples
c++vectorizationx86-64sseavx2

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?


Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.

So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:

// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);

// ...

// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x,
  gHigh32Permute)); // This seems to take 3 cycles

Solution

  • That shuffle+cast with _mm256_permutevar8x32_ps is optimal for one vector on Intel and Zen 2 or later. One one-uop instruction is the best you can get. (Two uops on AMD Zen 2 and Zen 3. One uop on Zen 4. https://uops.info/)

    Use vpermps instead of vpermd to avoid any risk for int / FP bypass delay if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to an integer instruction is usually fine on Intel (I'm less sure about feeding the result of an FP instruction to an integer shuffle).

    If tuning for Intel, you can change the surrounding code so that you can shuffle into the bottom 64-bits of each 128-bit lane. It avoids a lane-crossing shuffle. (Then you can just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)

    With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.


    Lane-crossing shuffles are slow on Zen 1. Agner Fog doesn't have numbers for vpermd, but lists vpermps (which probably uses the same hardware internally) at three uops, five cycles of latency, one per four cycles of throughput. https://uops.info/ confirms those numbers for Zen 1.

    Zen 2 and Zen 3 have 256-bit wide vector execution units for the most part, but sometimes their lane-crossing shuffles with elements smaller than 128-bit take multiple uops. Zen 4 improves things, like 0.5 cycles throughput vpermps with four cycles of latency.

    vextractf128 xmm, ymm, 1 is very efficient on Zen 1 (1c latency, 0.33c throughput), which is not surprising since it tracks 256-bit registers as two 128-bit halves. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.

    This also saves you a register for the vpermps shuffle mask you don't need anymore. (One vpermps to get the elements you want grouped into the high and low lanes for vextractf128. Or if latency is important, two control vectors for 2x vpermps on CPUs where it's single-uop) So for CPUs with multi-uop vpermps, especially Zen 1, I'd suggest:

    __m256d x = /* computed here */;
    
    // Tuned for Zen 1 through Zen 3.  Probably sub-optimal everywhere else.
    __m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));  // vextractf128
    __m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));    // no instructions
    __m128 odd  = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
    __m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));
    

    On Intel, using three shuffles instead of two reaches two thirds of the optimal throughput, with one cycle extra latency for the first result.

    On Zen 2 and Zen 3 where vpermps is two uops vs. one for vextractf128, extract + 2x vshufps is better than 2x vpermps.

    Also the E-cores on Alder Lake have two-uop vpermps but one-uop vextractf128 and vshufps xmm