Search code examples
c++visual-c++sseavx

How does MSVC avoid mixing SSE and AVX?


Despite an infamous penalty for mixing SSE and AVX encoding (see Why is this SSE code 6 times slower without VZEROUPPER on Skylake?), there may be a need to mix 128-bit and 256-bit operations.

The penalty can be avoided by always AVX encoding, even for 128-but operation, or adding vzeroupper before any SSE encoding.

For compiler-generated code, if AVX is enabled, a compiler would assume that AVX is available, and will use AVX encoding. For every function that can be called externally, a compiler would insert vzeroupper in the end.

MSVC however allows generation of AVX code without AVX enabled via the direct use of intrinsics (unlike some other compilers which would require an AVX-enabling option to use AVX intrinsics).

How would it avoid mixing SSE and AVX if both intrinsics are used in a single function?


Solution

  • The compiler would use AVX encoding after the first AVX intrinsic. For example, the following function:

    void test1(__m256i* dest, char x, char y)
    {
            __m256i a = _mm256_broadcastw_epi16(_mm_cvtsi32_si128(x)); // movd then vpbroadcastw
            __m256i b = _mm256_broadcastw_epi16(_mm_cvtsi32_si128(y)); // vmovd then vpbroadcastw
            _mm256_store_si256(dest, _mm256_andnot_si256(a, b));
    }
    

    would have the first _mm_cvtsi32_si128 encoded as movd, and the second as vmovd. And it will insert vzeroupper in the end.

    It will use AVX encoding from the beginning if a parameter is taken via AVX register (this happens using __vectorcall calling convention). The same way, if __m256i type is returned, vzeroupper will not be inserted in the end.

    This does not apply to unoptimized compilation. With /Od or no /O... option, it will just use the minimum level encoding for any of the instructions. It will also not insert vzeroupper in the end for unoptimized compilation.

    Godbolt's compiler explorer demo.


    Unfortunately, this does not always work

    In this issue it was discussed that in some situation MSVC still emits non-VEX-encoded SSE in AVX code. This:

    #include <intrin.h>
    
    static __m128i __vectorcall f1(__m128i a, __m128i b)
    {
        return  _mm_add_epi32(a, b);
    }
    
    __m256i __vectorcall f2(__m128i a, __m128i b)
    {
        // Non-VEX+VEX
        return  _mm256_cvtepi16_epi32(f1(a, b));
    }
    
    __m256i __vectorcall WholeFunction(__m128i a, __m128i b) {
        __m256i c = _mm256_xor_si256(_mm256_castsi128_si256(a), _mm256_castsi128_si256(b));
        return _mm256_xor_si256(f2(a, b), c);
    }
    

    Makes the compiler mix paddd and vmovups.

    Godbolt's compiler explorer demo.

    I've created Developer Community issue 10618264.