Search code examples
simdintrinsicsavx512

Why duplicated function in AVX512 to set zero?


I came to those two functions:

  • _mm512_setzero_epi32()
  • _mm512_setzero_si512()

Logically, they are the same doing the same thing. Then I checked the generated Assembly and also found the same under different optimization levels.

It is a simple question to ask: why the AVX512 has such a duplicated design to set 0 for int?


Solution

  • _mm512_setzero_epi32() is 100% redundant, no reason to ever use

    For coding-style reasons, I'd recommend against it. It doesn't follow the same pattern of _mm_setzero_si128() / _mm256_setzero_si256() for returning a SIMD-integer vector of all-zeros which _mm512_setzero_si512() follows.

    The situation is very similar to the useless and redundant _mm512_loadu_epi32 (which confusingly loads a whole 64-byte vector, not a 4-byte scalar). Not all compilers even support _mm512_loadu_epi32 or _mm512_loadu_epi64, which might also be the case for _mm512_setzero_epi32; another reason to avoid it in favour of more standard and obvious ones.

    For redundant intrinsics like _mm512_loadu_epi32 and _mm512_and_epi32, they're part of a pattern like _mm512_maskz_loadu_epi32 and _mm512_mask_loadu_epi32; masking requires an element size, and having an unmasked intrinsic as least forms a pattern like for _mm512_add_epi32 where different element-size versions of the same operation have to exist, and there is not _si512 version.

    But there are no merge-masking or zero-masking setzero intrinsics in the current version of the intrinsics guide. So there's no pattern for setzero_epi32 to be part of.


    In asm, there is no vpxor zmm, only vpxord and vpxorq, because essentially all AVX-512 instructions support masking, and that means there has to be an element size. (Same for moves like vmovdqa64 / 32.)

    So does _mm512_setzero_epi32() imply use of vpxord? No, Intel's intrinsics guide actually documents it as using vpxorq, like all other 512-bit zeroing intrinsics (including _mm512_setzero_ps() - fun fact; EVEX vxorps requires the AVX512DQ extension, not supported in KNL Xeon Phi, only in mainstream (Skylake-avx512 and later) CPU).

    As for what zeroing instruction compilers actually choose to use, could be either, and it makes no difference.