Broadcasting DWORD to YMM

I am just wondering if the below code:

mov eax, r9d    ; eax = j
mul n           ; eax = n * j
shl eax, 2      ; eax = 4 * n * j
                ; now I want to 'broadcast' this to YMM, like so:
                ; ymm = { eax, eax, eax, eax, eax, eax, eax, eax }

  ; This requires AVX512, not just AVX2
  ; vpbroadcastd ymm7, eax

  movd xmm7, eax             ; therefore I must do this workaround?
  vpbroadcastd ymm7, xmm7    ; and finally, the result

Can this somehow be simplified or optimized?

Solution

Yes, vmovd + vpbroadcastd is the normal way if you don't have AVX512, for both Intel and AMD CPUs.

I see 2 optimizations:

Replace mul n with imul r9d, n since you're not using the EDX high half of the multiply result anyway. 2-operand imul r32, r/m32 is a single uop on all modern CPUs; mul r/m32 takes multiple. https://uops.info/ https://agner.org/optimize/. (And of course if n is an immediate constant, imul eax, r9d, n*4).

Use a VEX prefix on movd xmm7, eax. i.e. vmovd xmm7, eax. If any YMM registers have dirty upper halves when legacy-SSE movd writes xmm7, it will trigger an AVX-SSE transition penalty on Haswell and Ice Lake. (Why is this SSE code 6 times slower without VZEROUPPER on Skylake? has the details for both HSW/ICL and the different strategy SKL uses.)

Without AVX512, yes, it takes a uop (like a movd instruction) to transfer data from GP-integer domain to SIMD domain, and that uop can't also broadcast. You then need another uop to do a shuffle.

As @chtz points out, if port 5 pressure in the back end on Intel CPUs is the major bottleneck for a loop including this (instead of total front-end uops or latency), you could mov store (e.g. to the stack) and vpbroadcastd reload.

Both vmovd xmm, r32 and vpbroadcastd can only run on port 5 on Intel CPUs. But a store is micro-fused p237 + p4, and a broadcast-load (of 32-bit or wider elements) is handled purely in a load port, not ALU uop needed, so the total cost is still 2 front-end uops on Intel CPUs, with a cost of p237+p4 + p23. Instead of 2p5. Store-forwarding latency of ~5 or 6 cycles actually similar to 1 to 3-cycle vmovd + 3-cycle vpbroadcastd so maybe this is worth considering for 32-bit and 64-bit broadcasts from registers, if there isn't much pressure on the load/store ports.

(Including maybe SSE3 movddup broadcast-loads into XMM registers, although in-lane shuffles are only 1 cycle latency so movd + xmm shuffle is only about 4 cycle latency on Haswell and later.)

It's easy to measure the latency of a movd xmm, r / movd r, xmm round trip, but hard to figure out which instruction has which latency. They might just be 1 cycle ALU plus a 1-cycle bypass delay on Skylake. Haswell apparently has 1-cycle movd in each direction. https://uops.info/ just measures a not-very-tight upper bound on latency by putting it in a loop with instructions to create a loop-carried dependency, and assuming others have 1-cycle latency. https://agner.org/optimize/ makes a guess on how to split up latency for a pair of instructions. Perhaps one could do better by including store-forwarding for one direction and an ALU transfer for the other, but store-forwarding latency on Sandybridge-family is notoriously variable, faster if you don't try to reload right away. (e.g. useless stores can speed up the critical path through a store-forwarding bottleneck. Adding a redundant assignment speeds up code when compiled without optimization). And store-forwarding between an integer store and vmovd xmm reload can't be assumed to have the same latency as an integer reload.

Skylake's movd xmm<->eax round trip is a total of 4 cycle latency, up from 2 in Sandybridge / Haswell. That could be 2 and 2 with bypass delays, or 1 and 3 without telling us which direction is slower.

Zen's is 6 cycles, so maybe 3 cycles each way.

AVX512F vpbroadcastd ymm, r32 is single-uop (port 5), so it's very good if you have AVX512.