Search code examples
c++visual-c++intrinsicsmicro-optimizationavx512

How to instruct MS Visual C++ compiler to use an uninitialized __m512i register


How to instruct the Visual C++ compiler (1926) to use an uninitialized __m512i register. In the following code snippet a not(or(A,B)) is calculated, the content of dummy is irrelevant.

__m512i dummy;
const __m512i n8 = _mm512_ternarylogic_epi64(dummy, A, B, 0x11);

Somehow the compiler assumes the register needs to have some content, (it does not), and an expensive and unnecessary memory reference is generated for zmm0:

62 F1 7E 48 6F 45 00 vmovdqu32   zmm0,zmmword ptr [rbp]  
62 F3 DD 48 25 C5 11 vpternlogq  zmm0,zmm4,zmm5,11h  

ICC 19.0.1 understands the this situation and does not generate the vmovdqu32.

What have I tried: initializing dummy with 0 replaces the vmovdqu32 with:

C5 F1 EF C9          vpxor       xmm1,xmm1,xmm1

This still gives an unnecessary instruction and a stall.

Thus the question: how to instruct the Visual C++ compiler to do the same as the Intel compiler? Just do not initialize the dummy register.


Solution

  • and a stall

    xor-zeroing is dependency breaking. It's also literally as cheap as a NOP on current Intel CPUs, and avoids the risk of an output dependency coupling this dep chain into another one. It won't cause a stall (except indirectly, like from an I-cache miss), but it is a potential waste of one fused-domain uop of front-end throughput.


    If A or B are dead after this, use one of them as the dummy input, like this

    __m512i nor_A(__m512i A, __m512i B) {
        return _mm512_ternarylogic_epi64(A, A, B, 0x11);
    }
    

    When not inlined, so the input regs are dead afterward, and it has to return in the same reg it received A in, all 4 major x86 compilers make ideal code for this simple case. (Some optimize the immediate to 5 instead of 0x11, I guess using the first input.)

    ; MSVC 19.24 -O2 -arch:AVX512 -Gv    (vectorcall calling convention)
    # gcc10/clang10/ICC19 -O3 -march=skylake-avx512
    nor_A:
            vpternlogq      zmm0, zmm0, zmm1, 17
            ret
    

    Or if you're using this in a loop, you could intentionally create a loop-carried dep chain by using the destination as the first input. Declare the vector outside the loop. If you're using ternlog inside a wrapper function, you'd need to pass a reference to the vector into that function to make this work.


    If you want to risk a false dependency, _mm512_undefined_epi32() is your best hope for what you want. It safely expresses what you want (an arbitrary register) while avoiding Undefined Behaviour from reading an uninitialized C variable. (And no, IDK why Intel thought epi32 would make more sense than si512 like _mm_undefined_si128(). There isn't a masked version of it!)

    ICC compiles it to zero extra instructions. Clang, GCC and MSVC do xor-zero a destination register, though, perhaps implementing it as _mm512_setzero_si512 if they don't really support undefined inputs in their internals. Godbolt

    I also included versions with actual UB; ICC and clang do what you wanted there, picking zmm0 as the dummy input.

    __m512i nor_undef(__m512i A, __m512i B) {
        return _mm512_ternarylogic_epi64(_mm512_undefined_epi32(), A, B, 0x11);
    }
    

    MSVC 19.24 -O2 -arch:AVX512 -Gv - not great, but basically fine, so the same source can compile to what you want for ICC without being terrible anywhere.

    __m512i nor_undef(__m512i,__m512i) PROC             ; nor_undef, COMDAT
        vpxor   xmm2, xmm2, xmm2
        vpternlogq zmm2, zmm0, zmm1, 17
        vmovdqu32 zmm0, zmm2
        ret     0
    

    GCC 10.1:

    nor_undef:
        vmovdqa64       zmm2, zmm0
        vpxor   xmm0, xmm0, xmm0
        vpternlogq      zmm0, zmm2, zmm1, 17
        ret
    

    Clang 10.0

    nor_undef:
        vpxor   xmm2, xmm2, xmm2
        vpternlogq      zmm0, zmm2, zmm1, 5
        ret
    

    ICC 19.0.1

    nor_undef:
        vpternlogq zmm0, zmm2, zmm1, 5                          #15.12
        ret                                                     #15.12