Search code examples
assemblyx86logiccpu-architecturehardware

Does the x86 architecture support packing bools as bits to parallelize logic operations?


Let's say I have 2 arrays of bools of length 64(or whatever register size) and I want to AND all the corresponding bools to a resultant 3rd array. Obviously its possible to pack the arrays into 2 registers and perform a bitwise AND in a single instruction, but this is much slower if bit fiddling is necessary to pack and unpack. Is there any x86 instruction(or any x86 extended set instruction) that performs the packing?


Solution

  • You'd normally keep your arrays packed all the time if you wanted to be able to do that efficiently, and access them with bit-indexing within a 64-bit register. e.g. with bt rdi, rax to set CF according to the bit-number indexed by RAX. bool CF = rdi & (1ULL<<(rax&63)).

    Don't use bt or bts with a memory destination; they have crazy-CISC bit-string semantics where bt [rdi], rax can index outside the qword at [rdi], using the whole RAX as a bit-index if the destination isn't a register.

    If your arrays are stored 1 bool per byte, you'd normally just use two vpand instructions to bitwise-AND 32 bytes at a time (AVX2). Just like if you were ANDing 256-bit bitmaps where only every 8th bit might be non-zero.

      vmovdqu  ymm0, [rdi]           ; load 32 bytes
      vpand    ymm0, ymm0, [rsi]     ; load+and 32 bytes from the 2nd source
      vmovdqu  [rdx], ymm0           ; store 32 bytes
    
      vmovdqu  ymm0, [rdi+32]         ; and repeat for the next 32 bytes.
      vpand    ymm0, ymm0, [rsi+32]
      vmovdqu  [rdx+32], ymm0
    

    A compiler should do this for you if you write for(int i=0;i<64;i++) c[i] = a[i]&b[i]; for uint8_t or bool elements.


    Packing bools to bitmaps with SSE2 or AVX2

    But if you want to pack bools to bitmaps, yeah, pmovmskb is the special x86 instruction you want for this, packing the top bit of each SIMD vector element into an integer. It's existed since SSE2, but AVX2 is fairly widely available and can go 32 at a time instead of just 16.

    See also How to create a byte out of 8 bool values (and vice versa)? for that and a multiply bithack for 8 bytes at a time.

    e.g. making a std::bitset<64> from a std::array<bool, 64>, using AVX2:

       vmovdqu  ymm0, [rdi]      ; first 32 bool elements
       vpslld   ymm0, ymm0, 7    ; shift the 0/1 to the top, 0x80 or 0x00 in each byte
       vpmovmskb eax, ymm0
    
       vmovdqu  ymm0, [rdi+32]
       vpslld   ymm0, ymm0, 7
       vpmovmskb edx, ymm0
    
         vzeroupper    ; if you might do any legacy SSE before next use of 256-bit vectors
    
       shl       rdx, 32         ; combine hi:lo halves
       or        rax, rdx        ; ((uint64_t)hi << 32) | lo
    # The 64 bits in RAX come from the bools in [rdi+0..63]
    

    So it's more work than just ANDing 32 bytes at a time from two inputs. If you wanted a packed result from two unpacked inputs, you'd probably want to _mm256_and_si256() them and then _mm256_slli_epi32 / _mm256_movemask_epi8 those AND results.

    To unpack again, see How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)? - it's less efficient without AVX-512.

    Using AVX-512

    AVX-512 can compare or test into a mask register, skipping the [v]pmovmskb step. But k0..7 mask registers are limited in what you can do with them (especially if you care about efficiency; kand can only run on port 5 on existing CPUs; https://uops.info/). And it takes a kmov to get data from them into GP registers like RAX.

    For example with intrinsics:

    #include <immintrin.h>
    
    // or I could have declared these as taking  bool *p  args
    __mmask64 foo(char *p){
        __m512i v = _mm512_loadu_si512(p);
        return _mm512_test_epi8_mask(v, v);
    }
    
    __mmask64 bar(char *p){
        __m512i v = _mm512_loadu_si512(p);
        return _mm512_cmpneq_epi8_mask(_mm512_setzero_si512(), v);
    }
    

    Compiles on Godbolt

    # GCC12 -O3 -march=skylake-avx512
    foo(char*):
            vmovdqu64       zmm0, ZMMWORD PTR [rdi]    # 64-byte load
            vptestmb        k0, zmm0, zmm0             # test into mask
            kmovq   rax, k0
            vzeroupper                  # could have used ZMM16..31 to avoid this
            ret
    
    bar(char*):
            vpxor   xmm0, xmm0, xmm0
            vpcmpb  k0, zmm0, ZMMWORD PTR [rdi], 4
            kmovq   rax, k0
            vzeroupper                    # not actually needed, this version doesn't write a ZMM register
            ret
    

    If I'd used two different input arrays, we could AND them together into a bitmask with one vptestmb instruction. So it's still better to do that, rather than separately pack the inputs for a kand k0, k1.

       vmovdqu32  zmm0, [rdi]
       vptestmb   k1, zmm0, [rsi]          ; k1 = packed bits of a[0..63] & b[0..63]
    

    See Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? re: vzeroupper being needed or not when you only read a ZMM register after zeroing it implicitly via XMM zeroing. Either way, compilers could have just used ZMM16..31 to avoid touching the upper part of y/zmm0..15. That would avoid transition stalls, and AFAIK there wouldn't be other penalties even though there'd be a non-zero ZMM register for the remainder of the program.

    Using 512-bit vectors can have some performance downsides if you don't make heavy use of them everywhere in your program, which is why compilers default to -mprefer-vector-width=256 for auto-vectorizing.

    If you do compare in two 32-byte halves, you might want kunpackdq k1, k1, k2 after comparing into k1 and k2, then kmov rax, k1. That concatenates the low 32 bits of k1 and k2.


    Unpacking

    AVX-512 finally added direct support for turning a mask into a vector of 0 / -1 elements, with vpmovm2b zmm0, k1 (docs). You could vpandd that with a vector of set1_epi8(1) to get bools.

    Otherwise, see