Search code examples
linuxperformancegccx86-64avx512

Enabling AVX512 support on compilation significantly decreases performance


I've got a C/C++ project that uses a static library. The library is built for 'skylake' architecture. The project is a data processing module, i.e. it performs many arithmetic operations, memory copying, searching, comparing, etc.

The CPU is Xeon Gold 6130T, it supports AVX512. I tried to compile my project with both -march=skylake and -march=skylake-avx512 and then link with the library.

In case of using -march=skylake-avx512 the project performance is significantly decreased (by 30% on average) in comparison to the project built with -march=skylake.

How can this be explained? What could be the reason?

Info:

  • Linux 3.10
  • gcc 9.2
  • Intel Xeon Gold 6130T

Solution

  • project performance is significantly decreased (by 30% on average)

    In code that cannot be easily vectorized, sporadic AVX instructions here and there downclock your CPU but do not provide any benefit. You may want to turn off AVX instructions completely in such scenarios.

    See Advanced Vector Extensions, Downclocking:

    Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

    • L0 (100%): The normal turbo boost limit.
    • L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
    • L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions. The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.

    Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL is an example of only using 256-bit operands in AVX-512, making it a sensible default for mixed loads.

    Also, see