Search code examples
c++x86bitwise-operatorsintrinsicsavx512

AVX512 perform AND of 512bits of 8-bit chars


I'd like to AND two vectors of 512 bits containing 8 bit elements.

Looking at the Intel Intrinsics Guide I can see some 512-bit AND operations:

__m512i _mm512_and_epi32 (__m512i a, __m512i b)
__m512i _mm512_and_epi64 (__m512i a, __m512i b)

but nothing for epi8 (or epi16).

Is it safe to use the epi64? My only hesitation is why they have provided both epi32 and epi64, presumably both could use epi32. Performance reasons?


Solution

  • Both are just simple bitwise AND; you can use either on any data.
    Or better, use _mm512_and_si512 which has the desired semantic meaning.

    In asm, vpandd and vpandq can be used with masking at 32-bit or 64-bit granularity, respectively. Masking is the only reason for having separate opcodes, unlike with AVX2 and earlier where there was just vpand (_mm256_and_si256 and _mm_and_si128).

    Without a mask, there's no significance to the element width. The only reason for _mm512_and_epi32 and epi64 to exist at all is for consistency with _mm512_mask[z]_and_epi[32|64].

    _mm512_and_si512 exists, and will compile to either vpandq or vpandd.
    The intrinsics guide says it's nominally an intrinsic for vpandd.
    IIRC, most compilers favour wider elements and will pick vpandq like how they use vmovdqa64 for _mm512_load_si512. If they don't fold it into a vpternlogq with some other bitwise booleans on the same data.

    AVX512BW added EVEX versions of instructions like vpaddb where element width matters even without masking. But didn't add byte or word mask widths for bitwise booleans, only vmovdqu8 / vmovdqu16 (and vpblendmb/vpblendmw) for separate load, store, or reg-reg blend (merge-masking) or zero-masking.

    For 128 and 256-bit vector widths, hopefully most compilers will use AVX2 vpand for _mm256_and_epi32 if the data is in YMM0-15 (instead of YMM16-31).


    All of this applies to the other groups of bitwise intrinsics, which I'll mention here just so search engines can find this:

    • _mm512_andnot_epi32 = _mm512_andnot_epi64 = _mm512_andnot_si512
    • _mm512_or_epi32 = _mm512_or_epi64 = _mm512_or_si512
    • _mm512_xor_epi32 = _mm512_xor_epi64 = _mm512_xor_si512
    • _mm512_ternarylogic_epi32 = _mm512_ternarylogic_epi64
      (there is no si512 version, even for 128 and 256-bit).

    And there is no epi32 or epi64 version of _mm256_and_si256, despite _mm256_mask_and_epi32 / maskz existing, also for epi64. (Same for the 128-bit version). Intel's decisions about which intrinsics to provide and which to omit seem pretty arbitrary.


    Fun fact: vandps/pd wasn't part of AVX512F (foundation), only integer vpandd/q were in that. The FP versions were added as part of AVX512DQ.
    (Xeon Phi is the only real hardware that has AVX512F without AVX512BW and DQ, and fewer redundant opcodes saves transistors in the decoders I guess, and I guess it didn't care about separate SIMD-int vs. FP domains for bypass forwarding. AVX-512 was an adaptation of the vector ISA developed for Larrabee and sold commercially in first-gen Xeon Phi, Knight's Corner).