Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.
So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:
// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
// ...
// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x,
gHigh32Permute)); // This seems to take 3 cycles
That shuffle+cast with _mm256_permutevar8x32_ps
is optimal for one vector on Intel and Zen 2 or later. One one-uop instruction is the best you can get. (Two uops on AMD Zen 2 and Zen 3. One uop on Zen 4. https://uops.info/)
Use vpermps
instead of vpermd
to avoid any risk for int / FP bypass delay if your input vector was created by a pd
instruction rather than a load or something. Using the result of an FP shuffle as an input to an integer instruction is usually fine on Intel (I'm less sure about feeding the result of an FP instruction to an integer shuffle).
If tuning for Intel, you can change the surrounding code so that you can shuffle into the bottom 64-bits of each 128-bit lane. It avoids a lane-crossing shuffle. (Then you can just use vshufps ymm
, or if tuning for KNL, vpermilps
since 2-input vshufps
is slower.)
With AVX512, there's _mm256_cvtepi64_epi32
(vpmovqd
) which packs elements across lanes, with truncation.
Lane-crossing shuffles are slow on Zen 1. Agner Fog doesn't have numbers for vpermd
, but lists vpermps
(which probably uses the same hardware internally) at three uops, five cycles of latency, one per four cycles of throughput. https://uops.info/ confirms those numbers for Zen 1.
Zen 2 and Zen 3 have 256-bit wide vector execution units for the most part, but sometimes their lane-crossing shuffles with elements smaller than 128-bit take multiple uops. Zen 4 improves things, like 0.5 cycles throughput vpermps
with four cycles of latency.
vextractf128 xmm, ymm, 1
is very efficient on Zen 1 (1c latency, 0.33c throughput), which is not surprising since it tracks 256-bit registers as two 128-bit halves. shufps
is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.
This also saves you a register for the vpermps
shuffle mask you don't need anymore. (One vpermps
to get the elements you want grouped into the high and low lanes for vextractf128
. Or if latency is important, two control vectors for 2x vpermps
on CPUs where it's single-uop) So for CPUs with multi-uop vpermps
, especially Zen 1, I'd suggest:
__m256d x = /* computed here */;
// Tuned for Zen 1 through Zen 3. Probably sub-optimal everywhere else.
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1)); // vextractf128
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x)); // no instructions
__m128 odd = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));
On Intel, using three shuffles instead of two reaches two thirds of the optimal throughput, with one cycle extra latency for the first result.
On Zen 2 and Zen 3 where vpermps
is two uops vs. one for vextractf128
, extract + 2x vshufps
is better than 2x vpermps
.
Also the E-cores on Alder Lake have two-uop vpermps
but one-uop vextractf128
and vshufps xmm