How to instruct the Visual C++ compiler (1926) to use an uninitialized __m512i
register. In the following code snippet a not(or(A,B))
is calculated, the content of dummy
is irrelevant.
__m512i dummy;
const __m512i n8 = _mm512_ternarylogic_epi64(dummy, A, B, 0x11);
Somehow the compiler assumes the register needs to have some content, (it does not), and an expensive and unnecessary memory reference is generated for zmm0
:
62 F1 7E 48 6F 45 00 vmovdqu32 zmm0,zmmword ptr [rbp]
62 F3 DD 48 25 C5 11 vpternlogq zmm0,zmm4,zmm5,11h
ICC 19.0.1 understands the this situation and does not generate the vmovdqu32
.
What have I tried: initializing dummy
with 0 replaces the vmovdqu32
with:
C5 F1 EF C9 vpxor xmm1,xmm1,xmm1
This still gives an unnecessary instruction and a stall.
Thus the question: how to instruct the Visual C++ compiler to do the same as the Intel compiler? Just do not initialize the dummy register.
and a stall
xor-zeroing is dependency breaking. It's also literally as cheap as a NOP on current Intel CPUs, and avoids the risk of an output dependency coupling this dep chain into another one. It won't cause a stall (except indirectly, like from an I-cache miss), but it is a potential waste of one fused-domain uop of front-end throughput.
If A
or B
are dead after this, use one of them as the dummy input, like this
__m512i nor_A(__m512i A, __m512i B) {
return _mm512_ternarylogic_epi64(A, A, B, 0x11);
}
When not inlined, so the input regs are dead afterward, and it has to return in the same reg it received A
in, all 4 major x86 compilers make ideal code for this simple case. (Some optimize the immediate to 5
instead of 0x11
, I guess using the first input.)
; MSVC 19.24 -O2 -arch:AVX512 -Gv (vectorcall calling convention)
# gcc10/clang10/ICC19 -O3 -march=skylake-avx512
nor_A:
vpternlogq zmm0, zmm0, zmm1, 17
ret
Or if you're using this in a loop, you could intentionally create a loop-carried dep chain by using the destination as the first input. Declare the vector outside the loop. If you're using ternlog inside a wrapper function, you'd need to pass a reference to the vector into that function to make this work.
If you want to risk a false dependency, _mm512_undefined_epi32()
is your best hope for what you want. It safely expresses what you want (an arbitrary register) while avoiding Undefined Behaviour from reading an uninitialized C variable. (And no, IDK why Intel thought epi32
would make more sense than si512
like _mm_undefined_si128()
. There isn't a masked version of it!)
ICC compiles it to zero extra instructions. Clang, GCC and MSVC do xor-zero a destination register, though, perhaps implementing it as _mm512_setzero_si512
if they don't really support undefined inputs in their internals. Godbolt
I also included versions with actual UB; ICC and clang do what you wanted there, picking zmm0
as the dummy input.
__m512i nor_undef(__m512i A, __m512i B) {
return _mm512_ternarylogic_epi64(_mm512_undefined_epi32(), A, B, 0x11);
}
MSVC 19.24 -O2 -arch:AVX512 -Gv
- not great, but basically fine, so the same source can compile to what you want for ICC without being terrible anywhere.
__m512i nor_undef(__m512i,__m512i) PROC ; nor_undef, COMDAT
vpxor xmm2, xmm2, xmm2
vpternlogq zmm2, zmm0, zmm1, 17
vmovdqu32 zmm0, zmm2
ret 0
GCC 10.1:
nor_undef:
vmovdqa64 zmm2, zmm0
vpxor xmm0, xmm0, xmm0
vpternlogq zmm0, zmm2, zmm1, 17
ret
Clang 10.0
nor_undef:
vpxor xmm2, xmm2, xmm2
vpternlogq zmm0, zmm2, zmm1, 5
ret
ICC 19.0.1
nor_undef:
vpternlogq zmm0, zmm2, zmm1, 5 #15.12
ret #15.12