c bit-manipulation simd intrinsics avx512

AVX512 - How to move all set bits to the right?

How can I move all set bits of mask register to right? (To the bottom, least-significant position).

For example:

__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101

If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111

How can I achieve this efficiently?

Below you can see how I tried to get the same result, but it's inefficient:

__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 
__mmask16 right_packed_mask =   _mm512_movepi32_mask(vcompress);   
// right_packed_mask: 0000111111111111

What is the best way to do this?

Solution

BMI2 pext is the scalar bitwise equivalent of v[p]compressd/q/ps/pd.
Use it on your mask value to left-pack them to the bottom of the value.

  mask = _pext_u32(-1U, mask);    // or _pext_u64(-1ULL, mask64)  for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.

Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16 and _cvtu32_mask16 to make the KMOVW explicit if you like.

See How to unset N right-most set bits for more about using pext/pdep in ways like this.

All current CPUs with AVX-512 also have fast BMI2 pext (including Xeon Phi), same performance as popcnt. AMD had slow pext until Zen 3, but if/when AMD ever introduces an AVX-512 CPU it should have fast pext/pdep.

For earlier AMD without AVX512, you might want (1ULL << __builtin_popcount(mask)) - 1, but be careful of overflow if all bits are set. 1ULL << 64 is undefined behaviour, and likely to produce 1 not 0 when compiled for x86-64.

If you were going to use vpcompressd, note that the source vector can simply be all-ones _mm512_set1_epi32(-1); compress doesn't care about elements where the mask was zero, they don't need to already be zero.

(It doesn't matter which -1s you pack; once you're working with boolean values, there's no difference between a true that came from your original bitmask vs. a constant true that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext, why you can use -1U as the source data instead of a pdep. i.e. a -1 or set bit doesn't have an identity; it's the same as any other -1 or set bit).

So let's try both ways and see how good/bad the asm is.

inline
__mmask16 leftpack_k(__mmask16 mask){
    return _pdep_u32(-1U, mask);
}

inline
__mmask16 leftpack_comp(__mmask16 mask) {
    __m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
    return _mm512_movepi32_mask(v);
}

Looking at stand-alone versions of these isn't useful because __mmask16 is a typedef for unsigned short, and is thus passed/returned in integer registers, not k registers. That makes the pext version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.

// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
    __mmask16 m = _mm512_test_epi32_mask(v,v);
    m = leftpack_comp(m);
    _mm512_mask_storeu_epi32(dst, m, v);
}

Commenting out the m = pack(m), this is just a simple 2 instructions that generate and then use a mask.

use_mask_nocompress(void*, long long __vector(8)):
        vptestmd        k1, zmm0, zmm0
        vmovdqu32       ZMMWORD PTR [rdi]{k1}, zmm0
        ret

So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw in favour of always kmovd. Godbolt

# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
        vptestmd        k0, zmm0, zmm0
        mov     eax, -1                        # could be hoisted out of a loop
        kmovd   edx, k0
        pdep    eax, eax, edx
        kmovw   k1, eax
        vmovdqu32       ZMMWORD PTR [rdi]{k1}, zmm0
        ret

use_leftpack_compress(void*, long long __vector(8)):
        vptestmd        k1, zmm0, zmm0
        vpternlogd      zmm2, zmm2, zmm2, 0xFF   # set1(-1)  could be hoisted out of a loop
        vpcompressd     zmm1{k1}{z}, zmm2
        vpmovd2m        k1, zmm1
        vmovdqu32       ZMMWORD PTR [rdi]{k1}, zmm0
        ret

So the non-hoistable parts are

kmov r,k (port 0) / pext (port 1) / kmov k,r (port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, and pext is 3 cycle latency, for a total of 7 cycle latency.
vpcompressd zmm{k}{z}, z (2 p5) / vpmovd2m (port 0) = 3 uops, two for port 5. vpmovd2m has 3 cycle latency on SKX / ICL, and vpcompressd-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.

Also, the hoistable part is generally worse (vpternlogd is longer and competes for fewer ports than mov r32, imm32), unless your function already needs an all-ones vector for something but not an all-ones register.

Conclusion: the BMI2 pext way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).