What is the best way to end up with a __mmask64
from four __mmask16
? I just want to concatenate them. Can't seem to find a solution on the internet.
AVX-512 has hardware instructions for concatenating two mask registers, for example 2x kunpckwd
instructions and one kunpckdq
would do the trick here.
(Each instruction is 4 cycle latency, port 5 only, on SKX and Ice Lake. https://uops.info. But at least the 2 independent ones in the first step can mostly overlap, starting one cycle apart, limited by competition for port 5. But they won't all be ready at once anyway, if the compiler schedules the instructions that generate the 4 masks so one pair should be ready first so it can get started.)
// compiles nicely with GCC/clang/ICC. Current MSVC has major pessimizations
inline
__mmask64 set_mask64_kunpck(__mmask16 m0, __mmask16 m1, __mmask16 m2, __mmask16 m3)
{
__mmask32 md0 = _mm512_kunpackw(m1, m0); // hi, lo
__mmask32 md1 = _mm512_kunpackw(m3, m2);
__mmask64 mq = _mm512_kunpackd(md1, md0);
return mq;
}
That's your best bet if your __mask16
values are actually in k
registers, where a compiler will have them if they're the result of AVX-512 compare/test intrinsics like _mm512_cmple_epu32_mask
. If they're coming from an array you generated earlier, it might be better to combine them with plain scalar stuff (See Paul's answer), instead of slowly getting them into mask registers with kmov
. kmov k, mem
is 3 uops for the front-end, with scalar integer load and a kmov k, reg
back-end uops, plus an extra front-end uop for no apparent reason.
__mmask16
is just a typedef for unsigned short
(in gcc/clang/ICC/MSVC) so you can simply manipulate it like an integer, and compilers will use kmov
as necessary. (This can lead to pretty inefficient code if you're not careful, and unfortunately current compilers aren't smart enough to compile a shift/OR function into using kunpckwd
.)
There are intrinsics like unsigned int _cvtmask16_u32 (__mmask16 a)
but they're optional for current compilers that implement __mmask16
as unsigned short
.
To look at compiler output for a case where __mmask16
values start out in k
registers, it's necessary to write a test function that uses intrinsics to create the mask values. (Or use inline asm constraints.) The standard x86-64 calling conventions handle __mmask16
as a scalar integer, so as a function arg it's already in an integer register, not a k
register.
__mmask64 test(__m256i v0, __m256i v1, __m256i v2, __m256i v3)
{
__mmask16 m0 = _mm256_movepi16_mask(v0); // clang can optimize _mm_movepi8_mask into pmovmskb eax, xmm avoiding k regs
__mmask16 m1 = _mm256_movepi16_mask(v1);
__mmask16 m2 = _mm256_movepi16_mask(v2);
__mmask16 m3 = _mm256_movepi16_mask(v3);
//return set_mask64_mmx(m0,m1,m2,m3);
//return set_mask64_scalar(m0,m1,m2,m3);
return set_mask64_kunpck(m0,m1,m2,m3);
}
With GCC and clang, that compiles to (Godbolt):
# gcc 11.1 -O3 -march=skylake-avx512
test(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
vpmovw2m k3, ymm0
vpmovw2m k1, ymm1
vpmovw2m k2, ymm2
vpmovw2m k0, ymm3 # create masks
kunpckwd k1, k1, k3
kunpckwd k0, k0, k2
kunpckdq k4, k0, k1 # combine masks
kmovq rax, k4 # use mask, in this case by returning as integer
ret
I could have used the final mask result for a blend intrinsic between two of the inputs, for example, but the compiler didn't try to avoid kunpck
by doing 4x kmov
(also only 1 port).
MSVC 19.29 -O2 -Gv -arch:AVX512 does a rather poor job, extracting each mask to a scalar integer regs between intrinsics. like
MSVC 19.29
kmovw ax, k1
movzx edx, ax
...
kmovd k3, edx
This is supremely dumb, not even using kmovw eax, k1
to zero-extend into a 32-bit register, not to mention not realizing that the next kunpck
only cares about the low part of its input anyway, so there was not need to kmov the data to/from an integer register at all. Later, it even uses this, apparently not realizing that kmovd
writing a 32-bit register zero-extends into the 64-bit register. (To be fair, GCC has some dumb missed optimizations like that around its __builtin_popcount
intrinsic.)
; MSVC 19.29
kmovd ecx, k2
mov ecx, ecx
kmovq k1, rcx
The kunpck
intrinsics do have strange prototypes, with inputs as wide as their outputs, e.g.
__mmask32 _mm512_kunpackw (__mmask32 a, __mmask32 b)
So perhaps this is tricking MSVC into manually doing the uint16_t
-> uint32_t
conversion by going to scalar and back, since it apparently doesn't know that vpmovw2m k3, ymm0
already zero-extends into the full k3
.