How can I move all set bits of mask register to right? (To the bottom, least-significant position).
For example:
__mmask16 mask = _mm512_cmpeq_epi32_mask(vload, vlimit); // mask = 1101110111011101
If we move all set bits to the right, we will get: 1101110111011101 -> 0000111111111111
How can I achieve this efficiently?
Below you can see how I tried to get the same result, but it's inefficient:
__mmask16 mask = 56797;
// mask: 1101110111011101
__m512i vbrdcast = _mm512_maskz_broadcastd_epi32(mask, _mm_set1_epi32(~0));
// vbrdcast: -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1 -1 0 -1 -1
__m512i vcompress = _mm512_maskz_compress_epi32(mask, vbrdcast);
// vcompress:-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0
__mmask16 right_packed_mask = _mm512_movepi32_mask(vcompress);
// right_packed_mask: 0000111111111111
What is the best way to do this?
BMI2 pext
is the scalar bitwise equivalent of v[p]compressd/q/ps/pd
.
Use it on your mask value to left-pack them to the bottom of the value.
mask = _pext_u32(-1U, mask); // or _pext_u64(-1ULL, mask64) for __mmask64
// costs 3 asm instructions (kmov + pext + kmov) if you need to use the result as a mask
// not including putting -1 in a register.
Implicit conversion between __mmask16 (aka uint16_t in GCC) and uint32_t works.
Use _cvtu32_mask16
and _cvtu32_mask16
to make the KMOVW explicit if you like.
See How to unset N right-most set bits for more about using pext/pdep in ways like this.
All current CPUs with AVX-512 also have fast BMI2 pext
(including Xeon Phi), same performance as popcnt. AMD had slow pext
until Zen 3, but if/when AMD ever introduces an AVX-512 CPU it should have fast pext
/pdep
.
For earlier AMD without AVX512, you might want (1ULL << __builtin_popcount(mask)) - 1
, but be careful of overflow if all bits are set. 1ULL << 64
is undefined behaviour, and likely to produce 1
not 0
when compiled for x86-64.
If you were going to use vpcompressd
, note that the source vector can simply be all-ones _mm512_set1_epi32(-1)
; compress doesn't care about elements where the mask was zero, they don't need to already be zero.
(It doesn't matter which -1
s you pack; once you're working with boolean values, there's no difference between a true
that came from your original bitmask vs. a constant true
that was just sitting there which you generated more cheaply, without a dependency on your input mask. Same reasoning applies for pext
, why you can use -1U
as the source data instead of a pdep
. i.e. a -1
or set bit doesn't have an identity; it's the same as any other -1
or set bit).
So let's try both ways and see how good/bad the asm is.
inline
__mmask16 leftpack_k(__mmask16 mask){
return _pdep_u32(-1U, mask);
}
inline
__mmask16 leftpack_comp(__mmask16 mask) {
__m512i v = _mm512_maskz_compress_epi32(mask, _mm512_set1_epi32(-1));
return _mm512_movepi32_mask(v);
}
Looking at stand-alone versions of these isn't useful because __mmask16
is a typedef for unsigned short
, and is thus passed/returned in integer registers, not k
registers. That makes the pext
version look very good, of course, but we want to see how it inlines into a case where we generate and use the mask with AVX-512 intrinsics.
// not a useful function, just something that compiles to asm in an obvious way
void use_leftpack_compress(void *dst, __m512i v){
__mmask16 m = _mm512_test_epi32_mask(v,v);
m = leftpack_comp(m);
_mm512_mask_storeu_epi32(dst, m, v);
}
Commenting out the m = pack(m)
, this is just a simple 2 instructions that generate and then use a mask.
use_mask_nocompress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So any extra instructions will be due to left-packing (compressing) the mask. GCC and clang make the same asm as each other, differing only in clang avoiding kmovw
in favour of always kmovd
. Godbolt
# GCC10.3 -O3 -march=skylake-avx512
use_leftpack_k(void*, long long __vector(8)):
vptestmd k0, zmm0, zmm0
mov eax, -1 # could be hoisted out of a loop
kmovd edx, k0
pdep eax, eax, edx
kmovw k1, eax
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
use_leftpack_compress(void*, long long __vector(8)):
vptestmd k1, zmm0, zmm0
vpternlogd zmm2, zmm2, zmm2, 0xFF # set1(-1) could be hoisted out of a loop
vpcompressd zmm1{k1}{z}, zmm2
vpmovd2m k1, zmm1
vmovdqu32 ZMMWORD PTR [rdi]{k1}, zmm0
ret
So the non-hoistable parts are
kmov r,k
(port 0) / pext
(port 1) / kmov k,r
(port 5) = 3 uops, one for each execution port. (Including port 1, which has its vector ALUs shut down while 512-bit uops are in flight). The kmov/kmov round trip has 4 cycle latency on SKX, and pext
is 3 cycle latency, for a total of 7 cycle latency.
vpcompressd zmm{k}{z}, z
(2 p5) / vpmovd2m
(port 0) = 3 uops, two for port 5. vpmovd2m
has 3 cycle latency on SKX / ICL, and vpcompressd
-zeroing-into-zmm has 6 cycle from the k input to the zmm output (SKX and ICL). So a total of 9 cycle latency, and worse port distribution for the uops.
Also, the hoistable part is generally worse (vpternlogd
is longer and competes for fewer ports than mov r32, imm32
), unless your function already needs an all-ones vector for something but not an all-ones register.
Conclusion: the BMI2 pext
way is not worse in any way, and better in several. (Unless surrounding code heavily bottlenecked on port 1 uops, which is very unlikely if using 512-bit vectors because in that case it can only be running scalar integer uops like 3-cycle LEA, IMUL, LZCNT, and of course simple 1-cycle integer stuff like add/sub/and/or).