Search code examples
c++cintrinsicsavxavx512

Difference between _mm256_extractf32x4_ps and _mm256_extractf128_ps


The intel documentation for _mm256_extractf32x4_ps and _mm256_extractf128_ps read very similar. I could only spot two differences:

  1. _mm256_extractf128_ps takes a const int as parameter, _mm256_extractf32x4_ps takes an int. This should not make any difference.
  2. _mm256_extractf128_ps requires AVX flags, while _mm256_extractf32x4_ps requires AVX512F + AVX512VL, making the former seemingly more portable across CPUs.

What justifies the existence of _mm256_extractf32x4_ps?


Solution

  • Right, the int arg has to become an immediate in both cases, so it needs to be a compile-time constant after constant propagation.

    And yeah, there's no reason to use the no-masking version of the C intrinsic for the AVX-512VL version in C; it only really makes sense to have _mm256_mask_extractf32x4_ps and _mm256_maskz_extractf32x4_ps.

    In asm you might want the AVX-512 version because an EVEX encoding is necessary to access ymm16..31, and only VEXTRACTF32X4 has an EVEX encoding. But this is IMO something your C compiler should be able to take care of for you, whichever intrinsic you write.

    If your compiler optimize intrinsics at all, it will know you're compiling with AVX-512 enabled and will use whatever shuffle allows it work with the registers it picked during register allocation. (e.g. clang has a very aggressive shuffle optimizer, often using different instructions or turning shuffles into cheaper blends when possible. Or sometimes defeating efforts to write smarter code than the shuffle optimizer comes up with).

    But some compilers (notably MSVC) don't optimize intrinsics, not even doing constant-propagation through them. I think Intel ICC is also like this. (I haven't looked at ICX, their newer clang/LLVM-based compiler.) This model makes it possible to use AVX-512 intrinsics without telling the compiler that it can use AVX-512 instructions on its own. In that case, compiling _mm256_extractf128_ps to VEXTRACTF32X4 to allow usage of YMM16..31 might be a problem (especially if there weren't other AVX-512VL instructions in the same block, or that will definitely execute if this one did).