Search code examples
performancex86-64simdavx512

Do AVX512 mask register reduce the execution time?


When doing an AVX512 operation (using intrinsics) with a mask register, does the content of the mask change anything to the computing performance (latency, throughput, occupation of ports, ...)?

For example: if I perform a _mm512_mask_fmadd_round_ps and my mask register has only 1 bit set, is it any different than having all bits set to 1?

Supposing I'm not using the results of the computation for anything but 1 of the lanes, should I always mask when possible or is it guaranteed (by spec or by measurements on actual CPUs, but maybe likely to change...) not to have any visible effect? Maybe with intrinsics being optimized before lowered to asm the compiler can see some form of ILP?


Solution

  • No, even for the rare instructions like vsqrtps where the execution units aren't full width, it does the full computation regardless of the mask, and just merges at the end. see performance numbers on https://uops.info/ (and https://agner.org/optimize/ for more info on understanding uops, ports, and latency).

    Although I think the merge-target even needs to be ready before the uop can dispatch, even though in theory it only needed to be ready after the ALU latency. Zero-masking (maskz instead of mask) doesn't have this issue because the destination is write-only, not merged into.

    No, masking doesn't make anything faster, it would just cost you extra instructions to set up the mask register. It probably doesn't even save power (which could indirectly help performance by allowing higher turbo for longer).

    CPUs already run most SIMD instructions with fixed latency for the 512, 256, and 128-bit versions, using wide execution units, so skipping some work would just mean some elements of the execution unit go idle; the whole point of short-vector CPU-style SIMD is to stuff more work into fewer instructions because finding dependencies between instructions (or uops) and out-of-order scheduling them to execution ports is complex and power-expensive to scale up.

    Breaking them up and looking for idle elements of SIMD execution units to try to send other work to those parts of an execution unit would defeat the purpose, requiring scheduling of each element separately.

    (GPU-style SIMD has a simple pipeline for each execution unit so it does basically work that way. CPUs care about latency, not just throughput, so they have one wide out-of-order exec pipeline per core, not huge numbers of pipelines.)


    If the only element you care about is known to be in the low 128 or 256 bits, you can use the _mm_fmadd_ps or _mm256_fmadd_ps version (although the rounding-override version is only available for 512-bit vector width, so if you need the _round override version, it has to be 512).

    Clang has a pretty clever shuffle optimizer that can sometimes see through compile-time-constant shuffles and notice that only one element is later extracted, and optimize accordingly.