AVX2 narrowing conversion, from uint16_t to uint8_t

I'd like to narrow a 2d array from 16 to 8 bits, using AVX2. The C++ code that works is as follows:

  auto * s = reinterpret_cast<uint16_t *>(i_frame.Y);
  auto * d = narrowed.data();

  for (auto y = 0; y < i_frame.Height; y++, s += i_frame.Pitch_Luma / 2, d += o_frame.Width)
  {
      for (auto x = 0; x < i_frame.Width; x++)
      {
          d[x] = static_cast<uint8_t>(s[x]);
      }
  }

Then I thought perhaps it would be more efficient to use AVX2 (all our systems have AVX2 support):

 auto * s = reinterpret_cast<uint16_t *>(i_frame.Y);
 auto * d = narrowed.data();

 for (auto y = 0; y < i_frame.Height; ++y, s += i_frame.Pitch_Luma / 2, d += o_frame.Width)
 {
     for (auto x = 0; x < i_frame.Width; x += 16)
     {
         auto src = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x));            
         auto v = _mm256_packus_epi16(src, _mm256_setzero_si256());

         v = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(3, 1, 2, 0));

         _mm_store_si128(reinterpret_cast<__m128i *>(d + x), _mm256_extracti128_si256(v, 0));
     }
 }

Question is whether my AVX2 conversion code is optimal and/or the correct way to do this. I may be missing an AVX2 command that makes this very easy. At least I was with the widening conversion.

Solution

vpackuswb and vpermq are fine for this, but you can arrange things so you get double the work done with those same instructions:

for (size_t x = 0; x < width; x += 32)
{
    auto src1 = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x));
    auto src2 = _mm256_load_si256(reinterpret_cast<const __m256i *>(s + x + 16));
   // sources are known to be in the 0..255 range so no saturation happens
    auto v = _mm256_packus_epi16(src1, src2);

    v = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(3, 1, 2, 0));

    _mm256_store_si256(reinterpret_cast<__m256i *>(d + x), v);
}

This may not be quite a drop-in replacement since the unroll factor changed, and so this may require additional care near the edge of the image. You may also need an unaligned store, if the destination was only 16-aligned (or increase the alignment if possible).

vpackuswb interprets the source data as signed int16_t, and saturates values outside the 0..255 range as it packs down to uint8_t. For inputs that never have the highest bit set (e.g. 10-bit or 12-bit unsigned in uint16_t elements), values above 255 with saturate to 255. But if the high bit is set, like full-range uint16_t input, it's treated as signed-negative and saturated to 0. (packs to do signed saturation to the -128 .. +127 isn't much more helpful when you want unsigned output.)

To truncate the bit-patterns (modulo instead of saturate), you'd want _mm256_and_si256(v, _mm256_set1_epi16(0x00FF)) on both inputs separately before packing.

Or if you want to keep the most-significant 8 bits of each uint16_t, you could shift them like _mm256_srli_epi16(src1, 2) to discard the low 2 bits of 10-bit data and put the rest at the bottom, ready for a saturating pack.

Shift Right Logical shifts in zeros, so this is usable on full-range uint16_t. With the shift-count being 8 for full-range u16, it's tempting to want to use whole-byte tricks like an unaligned load so the bytes we want are already in the bottom of each word element, but then we'd have to and. That could cost fewer uops (e.g. with a memory source operand for vpand but not shift-immediate until AVX-512), and non-shuffle uops that can run on more ports, but every other load will be a cache-line split which may be a worse bottleneck than the front-end.