Search code examples
x86-64intrinsicsavx512

Packed bit test for __m512


There is no intrinsic for __m512 packed bit test (like _mm512_testz_si512).

What's the best way to do it?


Solution

  • _mm512_test_epi32_mask(v,v) == 0 is the drop-in replacement.

    Test-into-mask and then test the mask to get a scalar bool you can branch on or whatever. The element-size of the test doesn't matter if you only care about whether the whole vector has a non-zero bit anywhere, but element sizes of 8/16/32/64 are available (asm manual / Intrinsic guide).

    You can also just use the mask as a 0 or non-zero integer if you don't want to branch on it right away and don't need to convert it to a bool, or if you want to know where the set bits are (bit-scan or popcount.) Or use it to zero-mask or merge-mask other AVX-512 operations.

        __mmask16 mask == _mm512_test_epi32_mask(v,v);   // 0 or non-zero integer
        if (mask != 0) {  // __mmask16 is in practice an alias for uint16_t
              // You might have further use for the mask, e.g.
              int first_match_index = std::countr_zero(mask);
        }
    

    In asm, the test/branch or getting a GPR integer could look like this:

      vptestmd  k0, zmm1, zmm1     ; mask of elements where zmm1&zmm1 was non-zero.
    
    ; branch on it.  Or a compiler might use cmovz or setz (create an actual bool)
      kortestw  k0, k0             ; set integer FLAGS according to k0|k0
      jz        vec_was_all_zero   ; branch if ZF==1
    
    ; or get a 0 / non-0  int  you can return, or bit-scan to find the first non-zero element
      kmovw     eax, k0
    

    Or depending on what you want to do with the mask, _mm512_testn_epi32_mask(v,v) to get NAND instead of AND. testn(v,v) == ~test(v,v). But if you just want to test the mask, you could do _mm512_test_epi32_mask(v,v) == 0xFFFF to check that all 16 elements had a non-zero bit, instead of checking that the testn result was 0. Actually compilers are bad at this; you need to use _kortestc_mask16_u8(msk,msk) (intrinsics guide) instead of msk == 0xFFFF to get compilers to make efficient asm (Godbolt).

    kortest sets the carry flag if the OR result is all-ones, so you actually can test for all-set as cheaply as all-clear for any mask width, so this is efficiently possible even without an immediate operand like you'd use for AVX2 _mm256_movemask_epi8(v) == -1 where a compiler would cmp eax, -1, which is slightly larger code-size than test eax,eax.

    So it mostly matters to avoid inverting the mask before countr_zero or whatever; branching can still be done without needing a kmov to a GPR first, unless you leave it up to current compilers.


    AVX-512 compares and tests are only available with a mask register as a destination (k0-k7), kind of like a compare + vpmovmskb rolled into one single-uop instruction. (_mm256_movemask_epi8 or ps/pd. The AVX-512 versions of those, extracting the high bit of each element, are vpmovd2m (_mm512_movepi32_mask), available for every element size including 16-bit, e.g. to grab the sign bits of ints or floats.)

    After you get a mask, there are two instructions for setting integer FLAGS conditions based on a k register: kortest (set FLAGS according to a bitwise OR of 2 masks, or a mask with itself), and AVX512DQ/BW ktest (... AND of 2 masks ...).

    So you can actually test two vectors at once for having any non-zero elements, like

     __mmask16 mask1 = _mm512_test_epi32_mask(v1,v1);
     __mmask16 mask2 = _mm512_test_epi32_mask(v2,v2);
     // or any other condition you want to check, like _mm512_cmple_epu32(x,y)
     if (mask1 | mask2) {
        // At least one was non-zero; sort out which if it matters.
        // Or maybe concatenate them (e.g. kunpckwd) and bit-scan the 32-bit mask
        //  to find an element index, maybe into memory they were loaded from
     }
    

    This would compile to 2x vptestmd and 1x kortestw. Same number of uops as vector OR + one vptestmd + kortest in this case; being able to check for any set bits in either of two masks is maybe useful with more complicated compares, like for exact equality.


    SSE4 / AVX ptest into integer FLAGS was always 2 uops on mainstream Intel CPUs anyway (https://uops.info/). Intrinsics like _mm256_testz_si256 expose various FLAGS conditions you can check, in this case ZF==1, getting the compiler to emit an instruction like jz, jnz, cmovz ecx, edx, or setz al, depending on how you use the resulting bool.

    One of the benefits of legacy-SSE ptest (not overwriting a source register) doesn't exist with AVX 3-operand instructions, but it was still occasionally useful to get AND or ANDN result when the input vectors weren't compare results or other all-0 / all-1 masks. (compare + ptest + jcc is worse than compare / pmovmskb / macro-fused test+jcc which is 3 total uops).

    AVX-512 is heavily designed around per-element masking (so for example instead of just widening _mm256_xor_si256 to 512, we have _mm512_xor_epi32 or 64 as the no-mask version of _mm512_maskz_and_epi32. Similarly, the AVX-512 version of ptest is now a per-element thing into a mask registers. Other than scalar FP compares into EFLAGS like vucomisd, AVX-512 regularized things so compares/tests always go into mask registers, not EFLAGS like ptest or general-purpose registers like pmovmskb.

    Related: