My goal is to develop code that compiles using SIMD instructions when they are available and doesn't when they are not. More specifically in my C code I am making explicit SIMD calls and checking whether or not these calls are valid based on processor info I am pulling.
I had a bunch of questions but after enough typing SO pointed me to: Detecting SIMD instruction sets to be used with C++ Macros in Visual Studio 2015
The only remaining question is how does the /arch
flag impact explicit SIMD code? Does this still work even if your arch is not set? For example, can I write AVX2 calls without having /arch:AVX2?
There's a few pieces to the answer here.
First, classic Intel 32-bit x86 code used the x87 instruction set for floating-point, and the compiler would generate code using float
and double
types for the x87. For a long time, this was the default behavior for the Visual C++ compiler when building for 32-bit. You can force it's use with 32-bit code with /arch:IA32
--this switch is not valid for 64-bit.
For AMD64 64-bit code (which has also been adopted by Intel for 64-bit known generically as x64), the x87 instruction set was deprecated along with 3DNow! and Intel MMX instructions when running in 64-bit mode. All float
and double
type code instead is generated using SSE/SSE2, although not necessarily using the full 4 float or 2 double element wide of the XMM
registers. Instead the compiler will usually generate scalar versions of the SSE/SSE2 instructions that use just XMML
--and in fact the __fastcall
calling convention and .NET marshalling rules for 64-bit only deal with XMML
as a result. This is the default behavior for the Visual C++ compiler when building for 64-bit. You can also use the same codegen for 32-bit via the /arch:SSE
or /arch:SSE2
switches--those switches aren't valid for x64 because they have to be there already.
Starting with Visual C++ 2015,
/arch:SSE2
is the default for 32-bit code gen and is implicitly required for all 64-bit code gen.
This brings us to /arch::AVX
. For both 32-bit and 64-bit codegen, this let's the compiler use the VEX prefix to encode the SSE/SSE2 instructions (either generated by the math compiler I talked about above or via explicit use of compiler intrinsics). This encoding uses 3-operands (dest, src1, src2)
instead of the traditional 2-operand (dest/src1, src2)
for Intel code. The net result of this is that all SSE/SSE2 code-gen makes more efficient use of the available registers. This is really the bulk of what using /arch:AVX
gets you.
There are other aspects of the compiler that also make use of the /arch
switch settings such as optimized memcpy
and the instruction set that is available for use by the auto-vectorizer in /O2
and /Ox
builds, etc. The compiler also assumes that if you use /arch:AVX
it is free to use SSE3, SSSE3, SSE4.1, SSE4.2, or AVX instructions as well as SSE/SSE2.
With /arch:AVX2
you get the same behavior with the VEX prefix and instruction sets, plus the compiler may choose to optimize the code to use the fused-multiply add (FMA3) instruction which is required for AVX2. The auto-vectorizer also can use AVX2 instructions with this switch active.
TL;DR: If you use the compiler intrinsics, you are assuming responsibility for making sure they won't crash at runtime due to an invalid instruction exception. The /arch
switch just let's you tell the compiler to use advanced instruction sets and encoding everywhere.
See this blog series for more details: DirectXMath: SSE, SSE2, and ARM-NEON; SSE3 and SSSE3; SSE4.1 and SSE 4.2; AVX; F16C and FMA; AVX2; and ARM64.