visual-c++compiler-optimization simd intrinsics avx

How is the arch parameter used when compiling code with visual studio?

My goal is to develop code that compiles using SIMD instructions when they are available and doesn't when they are not. More specifically in my C code I am making explicit SIMD calls and checking whether or not these calls are valid based on processor info I am pulling.

I had a bunch of questions but after enough typing SO pointed me to: Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015

The only remaining question is how does the /arch flag impact explicit SIMD code? Does this still work even if your arch is not set? For example, can I write AVX2 calls without having /arch:AVX2?

Solution

There's a few pieces to the answer here.

First, classic Intel 32-bit x86 code used the x87 instruction set for floating-point, and the compiler would generate code using float and double types for the x87. For a long time, this was the default behavior for the Visual C++ compiler when building for 32-bit. You can force it's use with 32-bit code with /arch:IA32--this switch is not valid for 64-bit.

For AMD64 64-bit code (which has also been adopted by Intel for 64-bit known generically as x64), the x87 instruction set was deprecated along with 3DNow! and Intel MMX instructions when running in 64-bit mode. All float and double type code instead is generated using SSE/SSE2, although not necessarily using the full 4 float or 2 double element wide of the XMM registers. Instead the compiler will usually generate scalar versions of the SSE/SSE2 instructions that use just XMML--and in fact the __fastcall calling convention and .NET marshalling rules for 64-bit only deal with XMML as a result. This is the default behavior for the Visual C++ compiler when building for 64-bit. You can also use the same codegen for 32-bit via the /arch:SSE or /arch:SSE2 switches--those switches aren't valid for x64 because they have to be there already.

Starting with Visual C++ 2015, /arch:SSE2 is the default for 32-bit code gen and is implicitly required for all 64-bit code gen.

This brings us to /arch::AVX. For both 32-bit and 64-bit codegen, this let's the compiler use the VEX prefix to encode the SSE/SSE2 instructions (either generated by the math compiler I talked about above or via explicit use of compiler intrinsics). This encoding uses 3-operands (dest, src1, src2) instead of the traditional 2-operand (dest/src1, src2) for Intel code. The net result of this is that all SSE/SSE2 code-gen makes more efficient use of the available registers. This is really the bulk of what using /arch:AVX gets you.

There are other aspects of the compiler that also make use of the /arch switch settings such as optimized memcpy and the instruction set that is available for use by the auto-vectorizer in /O2 and /Ox builds, etc. The compiler also assumes that if you use /arch:AVX it is free to use SSE3, SSSE3, SSE4.1, SSE4.2, or AVX instructions as well as SSE/SSE2.

With /arch:AVX2 you get the same behavior with the VEX prefix and instruction sets, plus the compiler may choose to optimize the code to use the fused-multiply add (FMA3) instruction which is required for AVX2. The auto-vectorizer also can use AVX2 instructions with this switch active.

TL;DR: If you use the compiler intrinsics, you are assuming responsibility for making sure they won't crash at runtime due to an invalid instruction exception. The /arch switch just let's you tell the compiler to use advanced instruction sets and encoding everywhere.

See this blog series for more details: DirectXMath: SSE, SSE2, and ARM-NEON; SSE3 and SSSE3; SSE4.1 and SSE 4.2; AVX; F16C and FMA; AVX2; and ARM64.