c++visual-c++avx compiler-flags amd-processor

What is the /d2vzeroupper MSVC compiler optimization flag doing?

What is the /d2vzeroupper MSVC compiler optimization flag doing?

I was reading through this Compiler Options Quick Reference Guide for Epyc CPUs from AMD: https://developer.amd.com/wordpress/media/2020/04/Compiler%20Options%20Quick%20Ref%20Guide%20for%20AMD%20EPYC%207xx2%20Series%20Processors.pdf

For MSVC, to "Optimize for 64-bit AMD processors", they recommend to enable /favor:AMD64 /d2vzeroupper.

What /favor:AMD64 is doing is clear, there is documentation about that in the MSVC docs. But I can't seem to find /d2vzeroupper being mentioned anywhere in the internet at all, no documentation anywhere. What is it doing?

Solution

TL;DR: When using /favor:AMD64 add /d2vzeroupper to avoid very poor performance of SSE code on both current AMD CPUs and Intel CPUs.

Generally /d1... and /d2... are "secret" (undocumented) MSVC options to tune compiler behavior. /d1... apply to complier front-end, /d2... apply to compiler back-end.

/d2vzeroupper enables compiler-generated vzeroupper instruction

See Do I need to use _mm256_zeroupper in 2021? for more information.

Normally it is by default. You can disable it by /d2vzeroupper-. See here: https://godbolt.org/z/P48crzTrb

/favor:AMD64 switch suppresses vzeroupper, so /d2vzeroupper enables it back.

The up-to-date Visual Studio 2022 has fixed that, so /favor:AMD64 still emits vzeroupper and /d2vzeroupper is not needed to enable it.

Reason: current AMD optimization guides (available from AMD site; direct pdf link) suggest:

2.11.6 Mixing AVX and SSE

There is a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero data. Transitioning in either direction will cause a micro-fault to spill or fill the upper 128 bits of all 16 YMM registers. There will be an approximately 100 cycle penalty to signal and handle this fault. To avoid this penalty, a VZEROUPPER or VZEROALL instruction should be used to clear the upper 128 bits of all YMM registers when transitioning from AVX code to SSE or unknown code

Older AMD processor did not need vzeroupper, so /favor:AMD64 implemented optimization for them, even though penalizing Intel CPUs. From MS docs:

/favor:AMD64

(x64 only) optimizes the generated code for the AMD Opteron, and Athlon processors that support 64-bit extensions. The optimized code can run on all x64 compatible platforms. Code that is generated by using /favor:AMD64 might cause worse performance on Intel processors that support Intel64.