Search code examples
cpu-architecturesimdalu

How exactly are AVX-512 instructions executed on ALU?


I have trouble understanding how 512 bit registers can be utilized on ALU on a single clock cycle. Are there multiple ALUs that divide the data or is there specialised ALU that can work with this?


Solution

  • Yes, SIMD 512-bit ALUs replicate 16x 32-bit FMA units for example, that's the whole idea of CPU SIMD: provide wide EUs so more work can go through the pipeline in the same number of instructions.

    e.g. note the "256-bit FMA" execution units in Haswell. (See David Kanter's deep-dive which compares against Sandybridge.) Also note how Haswell widened the load/store path from/to L1d cache from 128 to 256-bit. (With Sandybridge doing address-generation once per 256-bit AVX load or store, but spending 2 cycles in the EU on the data.)

    Multiple microarchitectures have worked by splitting SIMD instructions into two half-width uops, like Intel Pentium-M for SSE and AMD Zen 1 for AVX, with only 64-bit or 128-bit SIMD execution units, respectively. But no existing x86 CPU has supported a SIMD instruction-set more than twice as wide as its vector ALUs. IDK about other ISAs.

    See https://agner.org/optimize/ and https://uops.info/ for details on those.

    And yes, this can take significant die area; that was one of the major arguments against AVX-512, that spending that area on more cores would be better for most programs. (And that it's a "power virus" to quote Linus Torvalds; as a kernel dev he's probably less inclined to see the benefit of wider SIMD, although I think he understands that user-space uses SIMD all over the place even for memcpy.)

    The area cost is why Intel CPUs often only have a half-width SIMD divide/sqrt unit, so the widest SIMD division the CPU supports has to split.