What is the /d2vzeroupper
MSVC compiler optimization flag doing?
I was reading through this Compiler Options Quick Reference Guide for Epyc CPUs from AMD: https://developer.amd.com/wordpress/media/2020/04/Compiler%20Options%20Quick%20Ref%20Guide%20for%20AMD%20EPYC%207xx2%20Series%20Processors.pdf
For MSVC, to "Optimize for 64-bit AMD processors", they recommend to enable /favor:AMD64 /d2vzeroupper
.
What /favor:AMD64
is doing is clear, there is documentation about that in the MSVC docs. But I can't seem to find /d2vzeroupper
being mentioned anywhere in the internet at all, no documentation anywhere. What is it doing?
TL;DR: When using /favor:AMD64
add /d2vzeroupper
to avoid very poor performance of SSE code on both current AMD CPUs and Intel CPUs.
Generally /d1...
and /d2...
are "secret" (undocumented) MSVC options to tune compiler behavior. /d1...
apply to complier front-end, /d2...
apply to compiler back-end.
/d2vzeroupper
enables compiler-generated vzeroupper
instruction
See Do I need to use _mm256_zeroupper in 2021? for more information.
Normally it is by default. You can disable it by /d2vzeroupper-
. See here: https://godbolt.org/z/P48crzTrb
/favor:AMD64
switch suppresses vzeroupper
, so /d2vzeroupper
enables it back.
The up-to-date Visual Studio 2022 has fixed that, so /favor:AMD64
still emits vzeroupper
and /d2vzeroupper
is not needed to enable it.
Reason: current AMD optimization guides (available from AMD site; direct pdf link) suggest:
2.11.6 Mixing AVX and SSE
There is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from AVX code to SSE or unknown code
Older AMD processor did not need vzeroupper
, so /favor:AMD64
implemented optimization for them, even though penalizing Intel CPUs. From MS docs:
/favor:AMD64
(x64 only) optimizes the generated code for the AMD Opteron, and Athlon processors that support 64-bit extensions. The optimized code can run on all x64 compatible platforms. Code that is generated by using /favor:AMD64 might cause worse performance on Intel processors that support Intel64.