Search code examples
performancex86simdavxavx2

Convert 128 bit AVX register with 8-bit elements to two 256 bit registers with 32-bit elements


I am reading 16 bytes of data in to a __m128i register and processing as 8 bit elements.

Later I need to convert the 16x 8-bit elements in to 16x 32-bit elements.

Obviously this requires 512 bits of storage. However, I presume it's best to avoid AVX 512 as it will reduce the CPU frequency?

Is it possible to convert 16x 8-bit elements from __m128i to two __256i registers, each containing 8x 32-bit elements?


Solution

  • The obvious way is two vpmovzxbd ymm,xmm instructions (or sx to sign-extend), asm manual entry. Unfortunately it can only read from the bottom of a vector, so it would take an extra shuffle to get the high 8 bytes of your input lined up. But until AVX-512, there are no lane-crossing shuffles with byte granularity. (vpermb which would need a vector constant and a mask constant to do this in one instruction).

      __m256i lo = _mm256_cvtepu8_epi32(v);
      __m256i hi = _mm256_cvtepu8_epi32(_mm_unpackhi_epi64(v,v));  // vpunpckhqdq + vpmovzxbd
    

    vpunpckhqdq can run on port 1 or 5 on Ice Lake, so it doesn't compete for the main shuffle unit on port 5 that vpmovzx* ymm, xmm needs. https://uops.info/.

    AMD Zen doesn't handle vpmovzxbd fully efficiently (single uop) until Zen 4. It might possibly be worth doing vinserti128 and 2x vpshufb with two different vector constants, if you're mainly tuning for AMD.


    If you had just loaded this __m128i from memory, you could instead do a 128-bit broadcast-load, vbroadcasti128 ymm, [mem]. Intrinsics for this are inconsistently and poorly designed, with the intrinsics for vbroadcastf128 taking a __m128 const * or __m128d const * pointer (not float*) but the vbroadcasti128 intrinsic taking __m128i by value despite the fact that the instruction only works with a memory source operand. So it requires the compiler to fold a _mm_loadu_si128 into the broadcast, or would tempt a compiler into spilling/reloading a __m128i, or into using vinserti128 to shuffle a register instead.

       const __m256i vbcst = _mm256_broadcastsi128_si256(_mm_loadu_si128((const __m128i*)addr));
       const __m128i v = _mm256_castsi256_si128(vbcst);  // try to convince the compiler to just use the low half for whatever you need
    
       __m256i lo = _mm256_cvtepu8_epi32(v);       // or with a different shuffle mask if optimizing for AMD
       __m256i hi = _mm256_shuffle_epi8(vbcst, mask);  // byte shuffle within each 128-bit half, which can also zero elements.
    

    However, I presume it's best to avoid AVX 512 as it will reduce the CPU frequency?

    Much smaller penalty in Ice Lake than in previous generations, but any need for a frequency/voltage transition causes a significant stall when it happens. (But then if you keep using 512-bit vectors, it'll stay at the new speed/voltage for the rest of the program). See SIMD instructions lowering CPU frequency

    Also, it's only 512-bit vector width that can cause a penalty, not a zero-masking vpermb ymm that requires AVX-512 VBMI + VL. (Getting a constant into a mask register would also cost a uop, so it might not save anything unless you're doing this in a loop.)

       __m256i lo = _mm256_cvtepu8_epi32(v);
    #ifdef __AVX512VBMI__      // Ice Lake / Zen 4
       __m256i hi = _mm256_maskz_permutexvar_epi8(0x11111111, _mm256_setr_epi32(0,1,2,3,4,5,6,7), v);
    #else
      ...
    #endif
    

    If you can't make effective use of 512-bit vectors for most of your program, yes, it's often better to only use 256-bit vectors. (Or if you're sharing a CPU with other work, by other threads or other processes.) But that shouldn't stop you from using AVX-512's new shuffles and features when they're useful with 256-bit vectors, if you can assume they're available. (If you'd have to make a different version of the function and dispatch based on run-time detection of AVX-512, it might not be worth it.)