Search code examples
gcccompilationsimdinstruction-setauto-vectorization

How can I limit autovectorization level in GCC?


In other words, is it possible to cap autovectorization instructions (obtained with -fast-math -ftree-vectorize) to something like AVX while still using AVX512 through explicit intrinsic call?

At the moment,

  • without -mavx512f, GCC fails saying it cannot compile my program without avx-512f support. Fair enough.
  • with -mavx512f, GCC starts to use it everywhere.

I've not found any options to let GCC use explicit AVX512 intrinsics while limiting itself to something else for auto-vectorization.


Edit: Just to give a bit more context… I have skylake-avx512 Xeon Gold nodes (2 FMA units) and a domain-specific program.

When I compile with -Ofast -march=skylake-avx512 -mtune=skylake-avx512 and run on one core, I get 30% more performance than -march=haswell ….

When I increase the number of cores to all 24 cores, -march=haswell … it twice faster than -march=skylake-avx512 …!

The reason is the infamous core throttling…

But my domain-specific software already includes hand-vectorized parts. I do get a performance win with -fno-tree-vectorize -march=skylake-avx512 … (but not enough to beat -march=haswell … with all 24 cores and autovec) therefore autovectorisation is important.

Finally, if I use AVX2-optimized hand-vectorized kernels with -march=skylake-avx512 …, I also get crappy performance, therefore I suppose that the expensive part that is inducing the throttling is indeed the auto-vectorization, hence my original question.


Solution

  • You can use the target attribute to enable instructions on a per-function basis, allowing you to call intrinsics which would otherwise not be allowed.

    I'm guessing you want to switch between implementations of certain functioons based on the CPU's capabilities as determined at runtime... If so, you may want to take a look at the target_clones attribute as well.