Is there an AVX or AVX2 operation to convert __m256i
of 16x16-bit unsigned int (uint16_t) values to __m128i
of 16x8-bit unsigned int (uint8_t) values (taking lower bytes with saturation)?
There is _mm256_packus_epi16()
but it uses first 8 bytes from first input, then first 8 bytes from second input, and then second 8 bytes from first and second input... resulting in groups of 8 bytes being out of order.
There are also some AVX512 ops that seems to do what's needed, but i can't depend on AVX512, it's not present on many target machines...
No, you can not do that in a single instruction with AVX/AVX2.
There is _mm256_packus_epi16() but it uses first 8 bytes from first input, then first 8 bytes from second input, and then second 8 bytes from first and second input... resulting in groups of 8 bytes being out of order.
Here's how you can arrange it properly (AVX2):
static inline __m128i convert(__m256i data) {
__m128i lo_lane = _mm256_castsi256_si128(data);
__m128i hi_lane = _mm256_extracti128_si256(data, 1);
return _mm_packus_epi16(lo_lane, hi_lane);
}
According to uops.info on Skylake _mm256_extracti128_si256
is 1 µop on p5 and _mm_packus_epi16
is 1 µop on p5. That means throughput of this code block should be 2 cycles (one conversion every two cycles).
You could target AVX by using _mm256_extractf128_si256
. It's possible that it would cost additional latency for domain-crossing (but throughput should be the same AFAIK).