Xcode Apple Clang enable avx512

In Xcode(Version 10.1 (10B61)), I used Macro as below to detect AVX512 support.

#ifdef __SSE4_1__
#error "sse4_1"
#endif

#ifdef __AVX__
#error "avx"
#endif

#ifdef __AVX2__
#error "avx2"
#endif

#ifdef __AVX512__
#error "avx512"
#endif

In default Build Settings, SSE4_1 is active, but avx, avx2 and is not. When I add -mavx in Building Settings-->Apple Clang-Custom Compiler Flags-->Other C Flags, that enable AVX, further adding -mavx2 to enable AVX and AVX2, but Unknow argument: '-mavx512'. How do you enable avx512 and detect it? It seems like there are few Macro to detect avx512.

#define __AVX512BW__ 1  
#define __AVX512CD__ 1  
#define __AVX512DQ__ 1  
#define __AVX512F__ 1  
#define __AVX512VL__ 1

What's differences between them?

Solution

AVX512 isn't a single extension, and doesn't have a specific-enough meaning in this context to be useful. Compilers only deal with specific CPU features, like AVX512F, AVX512DQ, AVX512CD, etc.

All CPUs that support any AVX512 extensions must support AVX512F, the "Foundation". AVX512F is the baseline AVX512 extension that other AVX512 extensions build on.

In code that wants to use AVX512 intrinsics, you should look at https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 and pick a set of extensions that are available together on one CPU you care about, e.g. F + CD and VL, DQ, BW on currently-available Skylake-X.

Then for example use #if defined(__AVX512BW__) && defined(__AVX512VL__) before code that uses vpermt2w on 256-bit vectors or something. __AVX512(anything)__ implies __AVX512F__; that's the one extension you don't have to check for separately.

But if you only used AVX512F instructions, they yeah just check for that macro.

You should pretty much never use -mavx512f directly: use -march=skylake-avx512, -march=knl, -march=znver4, -march=native, or whatever to enable other AVX512 extensions.

Or -march=x86-64-v4 for a generic set of AVX-512 features supported by Skylake-AVX512 and others except Xeon Phi, without implying tuning for a specific ISA. (See https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels). On macOS, you know you're only going to be dealing with Intel CPUs, so tune for them as well as enabling extensions with -march=icelake-client or similar.

The compiler knows which CPUs support which sets of extensions (or with native can detect which extensions the machine you're compiling on supports). There are a lot of them, and leaving out important ones like AVX512VL (support for AVX512 instructions on 128-bit and 256-bit vectors) or Xeon Phi's AVX512ER (fast 1/x and 1/sqrt(x) with twice the precision of the normal AVX512 14-bit versions) could hurt performance significantly. Especially AVX512ER is very important if you do any division or log/exp on Xeon Phi, because full-precision division is very slow on KNL compared to Skylake.
-march=x implies -mtune=x, enabling tuning options relevant for the target as well. KNL is basically Silvermont with AVX512 bolted on, and has significant differences from -mtune=skylake-avx512.

These are the same reasons you should generally not use -mfma -mavx2 directly, except that there are currently no AMD CPUs with AVX512, so there are only 2 main tuning targets (Xeon Phi and mainstream Skylake/CannonLake/Icelake), and they also support different sets of AVX512 extensions. There is unfortunately no -mtune=generic-avx2 tuning setting, but Ryzen supports almost all extensions that Haswell does (and the ones it doesn't GCC / clang won't use automatically, like transactional memory), so -march=haswell might be reasonable to make code tuned for CPUs with FMA, AVX2, popcnt, etc, without suffering too much on Ryzen.

Also relevant (for GCC, maybe not clang currently. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html):

-mprefer-vector-width=256 auto-vectorize with 256-bit vectors by default, in case most of the time is spent in non-vectorized loops. Using 512-bit vectors reduces the max turbo clock speed by a significant amount on Intel Xeon CPUs (maybe not as much on i9 desktop versions of Skylake-X), so it can be a net slowdown to use 512-bit vectors in small scattered bits of your program. So 256 is the default for tune=skylake-avx512 in GCC, but KNL uses 512.
-mprefer-avx-128 the old version of the -mprefer-vector-width= option, before AVX512 existed.

Using AVX512 mask registers, 32 vector registers, and/or its new instructions, can be a significant win even at the same vector width, so it makes sense to enable AVX512 even if you don't want to use 512-bit vector width. (Although sometimes code using intrinsics or auto-vectorization will compile in a worse way, instead of better, if AVX512 compare-into-register versions of comparison are available at all. But hopefully anti-optimization bugs like that will be sorted out as AVX512 becomes more widely used.)