Pack (with saturation) m256i of 16-bit values to m128i of 8-bit values?

Is there an AVX or AVX2 operation to convert __m256i of 16x16-bit unsigned int (uint16_t) values to __m128i of 16x8-bit unsigned int (uint8_t) values (taking lower bytes with saturation)?

There is _mm256_packus_epi16() but it uses first 8 bytes from first input, then first 8 bytes from second input, and then second 8 bytes from first and second input... resulting in groups of 8 bytes being out of order.

There are also some AVX512 ops that seems to do what's needed, but i can't depend on AVX512, it's not present on many target machines...

Solution

No, you can not do that in a single instruction with AVX/AVX2.

There is _mm256_packus_epi16() but it uses first 8 bytes from first input, then first 8 bytes from second input, and then second 8 bytes from first and second input... resulting in groups of 8 bytes being out of order.

Here's how you can arrange it properly (AVX2):

static inline __m128i convert(__m256i data) {
  __m128i lo_lane = _mm256_castsi256_si128(data);
  __m128i hi_lane = _mm256_extracti128_si256(data, 1);
  return _mm_packus_epi16(lo_lane, hi_lane);
}

According to uops.info on Skylake _mm256_extracti128_si256 is 1 µop on p5 and _mm_packus_epi16 is 1 µop on p5. That means throughput of this code block should be 2 cycles (one conversion every two cycles).

You could target AVX by using _mm256_extractf128_si256. It's possible that it would cost additional latency for domain-crossing (but throughput should be the same AFAIK).

Pack (with saturation) __m256i of 16-bit values to __m128i of 8-bit values?

Pack (with saturation) m256i of 16-bit values to m128i of 8-bit values?