Search code examples
c++ssesimdintrinsicsavx

Do I need to use _mm256_zeroupper in 2021?


From Agner Fog's "Optimizing software in C++":

There is a problem when mixing code compiled with and without AVX support on some Intel processors. There is a performance penalty when going from AVX code to non-AVX code because of a change in the YMM register state. This penalty should be avoided by calling the intrinsic function _mm256_zeroupper() before any transition from AVX code to nonAVX code. This can be necessary in the following cases:

• If part of a program is compiled with AVX support and another part of the program is compiled without AVX support then call _mm256_zeroupper() before leaving the AVX part.

• If a function is compiled in multiple versions with and without AVX using CPU dispatching then call _mm256_zeroupper() before leaving the AVX part.

• If a piece of code compiled with AVX support calls a function in a library other than the library that comes with the compiler, and the library has no AVX support, then call _mm256_zeroupper() before calling the library function.

I'm wondering what are some Intel processors. Specifically, are there processors made in the last five years. So that I know if it is too late to fix missing _mm256_zeroupper() calls or not.


Solution

  • TL:DR: Don't use the _mm256_zeroupper() intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)


    "Some Intel processors" being all except Xeon Phi.

    Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.

    On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty

    Related Q&As about the effects of asm vzeroupper (in the compiler-generated machine code):


    Intrinsics in C or C++ source

    You should pretty much never use _mm256_zeroupper() in C/C++ source code. Things have settled on having the compiler insert a vzeroupper instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper after that.

    The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256 / __m256i/d args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.

    x86-64 System V passes vectors in vector regs. Windows vectorcall does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.

    Some compilers optimize by also omitting vzeroupper before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i) will leave the CPU in clean-upper state, so the callee does still need a vzeroupper after such a function, before call printf or whatever.


    Compilers have options to control vzeroupper usage

    For example, GCC -mno-vzeroupper / clang -mllvm -x86-use-vzeroupper=0. (The default is -mvzeroupper to do the behaviour described above, using when it might be needed.)

    This is implied by -march=knl (Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).

    Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.

    For MSVC, you should definitely prefer using -arch:AVX when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128 and __m256 without /arch:AVX. But beware that that option will make even 128-bit intrinsics like _mm_add_ps use the AVX encoding (vaddps) instead of legacy SSE (addps), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper to enable automatic vzeroupper generation (default), /d2vzeroupper- disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?


    Corner case where MSVC and GCC/clang can be tricked into executing a legacy-SSE encoding that writes an XMM register with dirty uppers:

    Compiler heuristics may be assuming that there will be a VEX encoding available for any instruction in a function that's definitely (unconditionally) already executed AVX instructions. But that's not the case; some, like cvtpi2ps xmm, mm (MMX+SSE) or movqd2d xmm, mm (SSE2) don't have VEX forms. Nor does _mm_sha1rnds4_epu32 - it was first introduced on Silvermont-family which didn't support AVX until Gracemont (Alder Lake), so it was introduced with 128-bit non-VEX encoding and still hasn't got a VEX encoding.

    #include <immintrin.h>
    
    void bar(char *dst, char *src)
    {
          __m256 vps = _mm256_loadu_ps((float*)src);
          _mm256_storeu_ps((float*)dst, _mm256_sqrt_ps(vps));
    
    #if defined(__SHA__) || defined(_MSC_VER)
            __m128i t1 = _mm_loadu_si128((__m128i*)&src[32]);
                     // possible MSVC bug, writing an XMM with a legacy VEX while an upper might be dirty
            __m128i t2 = _mm_sha1rnds4_epu32(t1,t1, 3);  // only a non-VEX form exists
            t1 = _mm_add_epi8(t1,t2);
            _mm_storeu_si128((__m128i*)&dst[32], t1);
    #endif
    #ifdef __MMX__  // MSVC for some reason dropped MMX support in 64-bit mode; IDK if it defines __MMX__ even in 32-bit but whatever
            __m128 tmpps = _mm_loadu_ps((float*)&src[48]);
            tmpps = _mm_cvtpi32_ps(tmpps, *(__m64*)&src[48]);
            _mm_storeu_ps((float*)&dst[48], tmpps);
    #endif
    
    }
    

    (This is not a sensible way to use SHA or cvtpi2ps, just randomly using vpaddb to force some extra register copying.)

    Godbolt

    # clang -O3 -march=icelake-client
    bar(char*, char*):
            vsqrtps ymm0, ymmword ptr [rsi]
            vmovups ymmword ptr [rdi], ymm0   # first block, AVX1
    
            vmovdqu xmm0, xmmword ptr [rsi + 32]
            vmovdqa xmm1, xmm0
            sha1rnds4       xmm1, xmm0, 3     # non-VEX encoding while uppers still dirty.
            vpaddb  xmm0, xmm1, xmm0
            vmovdqu xmmword ptr [rdi + 32], xmm0
    
            vmovups xmm0, xmmword ptr [rsi + 48]
            movdq2q mm0, xmm0
            cvtpi2ps        xmm0, mm0         # again same thing
            vmovups xmmword ptr [rdi + 48], xmm0
            vzeroupper                        # vzeroupper not done until here, too late for code in this function.
            ret
    

    MSVC and GCC are about the same. (Although GCC optimizes away the use of the MMX register in this case, using vcvtdq2ps / vshufps. That presumably wouldn't always happen.)

    These are compiler bugs that should be fixed in the compiler, although you may be able to work around them with _mm256_vzeroupper() in specific cases if necessary.


    Normally compiler heuristics work fine; e.g. the asm block for if(a) _mm256... will end with a vzeroupper if later code in the function might conditionally run legacy SSE encodings of normal instructions like paddb. (This is only possible with MSVC; gcc/clang require functions containing AVX1 / 2 instructions to be compiled with __attribute__((target("avx"))) or "avx2", which lets them use vpaddb for _mm_add_epi8 anywhere in the function. You have to branch / dispatch based on CPU features on a per-function level, which makes sense because normally you'd want to run a whole loop with AVX or not.)