Search code examples
x86-64simdavx512

Construct a 64 bit mask register from four 16 bit ones


What is the best way to end up with a __mmask64 from four __mmask16? I just want to concatenate them. Can't seem to find a solution on the internet.


Solution

  • AVX-512 has hardware instructions for concatenating two mask registers, for example 2x kunpckwd instructions and one kunpckdq would do the trick here.

    (Each instruction is 4 cycle latency, port 5 only, on SKX and Ice Lake. https://uops.info. But at least the 2 independent ones in the first step can mostly overlap, starting one cycle apart, limited by competition for port 5. But they won't all be ready at once anyway, if the compiler schedules the instructions that generate the 4 masks so one pair should be ready first so it can get started.)

    // compiles nicely with GCC/clang/ICC.  Current MSVC has major pessimizations
    inline
    __mmask64 set_mask64_kunpck(__mmask16 m0, __mmask16 m1, __mmask16 m2, __mmask16 m3)
    {
        __mmask32 md0 = _mm512_kunpackw(m1, m0);  // hi, lo
        __mmask32 md1 = _mm512_kunpackw(m3, m2);
        __mmask64 mq = _mm512_kunpackd(md1, md0);
        return mq;
    }
    

    That's your best bet if your __mask16 values are actually in k registers, where a compiler will have them if they're the result of AVX-512 compare/test intrinsics like _mm512_cmple_epu32_mask. If they're coming from an array you generated earlier, it might be better to combine them with plain scalar stuff (See Paul's answer), instead of slowly getting them into mask registers with kmov. kmov k, mem is 3 uops for the front-end, with scalar integer load and a kmov k, reg back-end uops, plus an extra front-end uop for no apparent reason.


    __mmask16 is just a typedef for unsigned short (in gcc/clang/ICC/MSVC) so you can simply manipulate it like an integer, and compilers will use kmov as necessary. (This can lead to pretty inefficient code if you're not careful, and unfortunately current compilers aren't smart enough to compile a shift/OR function into using kunpckwd.)

    There are intrinsics like unsigned int _cvtmask16_u32 (__mmask16 a) but they're optional for current compilers that implement __mmask16 as unsigned short.


    To look at compiler output for a case where __mmask16 values start out in k registers, it's necessary to write a test function that uses intrinsics to create the mask values. (Or use inline asm constraints.) The standard x86-64 calling conventions handle __mmask16 as a scalar integer, so as a function arg it's already in an integer register, not a k register.

    __mmask64 test(__m256i v0, __m256i v1, __m256i v2, __m256i v3)
    {
        __mmask16 m0 = _mm256_movepi16_mask(v0);  // clang can optimize _mm_movepi8_mask into pmovmskb eax, xmm avoiding k regs
        __mmask16 m1 = _mm256_movepi16_mask(v1);
        __mmask16 m2 = _mm256_movepi16_mask(v2);
        __mmask16 m3 = _mm256_movepi16_mask(v3);
    
        //return set_mask64_mmx(m0,m1,m2,m3);
        //return set_mask64_scalar(m0,m1,m2,m3);
        return set_mask64_kunpck(m0,m1,m2,m3);
    }
    

    With GCC and clang, that compiles to (Godbolt):

    # gcc 11.1  -O3 -march=skylake-avx512
    test(long long __vector(4), long long __vector(4), long long __vector(4), long long __vector(4)):
            vpmovw2m        k3, ymm0
            vpmovw2m        k1, ymm1
            vpmovw2m        k2, ymm2
            vpmovw2m        k0, ymm3     # create masks
    
            kunpckwd        k1, k1, k3
            kunpckwd        k0, k0, k2
            kunpckdq        k4, k0, k1   # combine masks
    
            kmovq   rax, k4              # use mask, in this case by returning as integer
            ret
    

    I could have used the final mask result for a blend intrinsic between two of the inputs, for example, but the compiler didn't try to avoid kunpck by doing 4x kmov (also only 1 port).

    MSVC 19.29 -O2 -Gv -arch:AVX512 does a rather poor job, extracting each mask to a scalar integer regs between intrinsics. like

    MSVC 19.29
            kmovw   ax, k1
            movzx   edx, ax
            ...
            kmovd   k3, edx
    

    This is supremely dumb, not even using kmovw eax, k1 to zero-extend into a 32-bit register, not to mention not realizing that the next kunpck only cares about the low part of its input anyway, so there was not need to kmov the data to/from an integer register at all. Later, it even uses this, apparently not realizing that kmovd writing a 32-bit register zero-extends into the 64-bit register. (To be fair, GCC has some dumb missed optimizations like that around its __builtin_popcount intrinsic.)

    ; MSVC 19.29
            kmovd   ecx, k2
            mov     ecx, ecx
            kmovq   k1, rcx
    

    The kunpck intrinsics do have strange prototypes, with inputs as wide as their outputs, e.g.

    __mmask32 _mm512_kunpackw (__mmask32 a, __mmask32 b)
    

    So perhaps this is tricking MSVC into manually doing the uint16_t -> uint32_t conversion by going to scalar and back, since it apparently doesn't know that vpmovw2m k3, ymm0 already zero-extends into the full k3.