I am just wondering if the below code:
mov eax, r9d ; eax = j
mul n ; eax = n * j
shl eax, 2 ; eax = 4 * n * j
; now I want to 'broadcast' this to YMM, like so:
; ymm = { eax, eax, eax, eax, eax, eax, eax, eax }
; This requires AVX512, not just AVX2
; vpbroadcastd ymm7, eax
movd xmm7, eax ; therefore I must do this workaround?
vpbroadcastd ymm7, xmm7 ; and finally, the result
Can this somehow be simplified or optimized?
Yes, vmovd
+ vpbroadcastd
is the normal way if you don't have AVX512, for both Intel and AMD CPUs.
I see 2 optimizations:
Replace mul n
with imul r9d, n
since you're not using the EDX high half of the multiply result anyway. 2-operand imul r32, r/m32
is a single uop on all modern CPUs; mul r/m32
takes multiple. https://uops.info/ https://agner.org/optimize/. (And of course if n
is an immediate constant, imul eax, r9d, n*4
).
Use a VEX prefix on movd xmm7, eax
. i.e. vmovd xmm7, eax
. If any YMM registers have dirty upper halves when legacy-SSE movd
writes xmm7, it will trigger an AVX-SSE transition penalty on Haswell and Ice Lake. (Why is this SSE code 6 times slower without VZEROUPPER on Skylake? has the details for both HSW/ICL and the different strategy SKL uses.)
Without AVX512, yes, it takes a uop (like a movd
instruction) to transfer data from GP-integer domain to SIMD domain, and that uop can't also broadcast. You then need another uop to do a shuffle.
As @chtz points out, if port 5 pressure in the back end on Intel CPUs is the major bottleneck for a loop including this (instead of total front-end uops or latency), you could mov
store (e.g. to the stack) and vpbroadcastd
reload.
Both vmovd xmm, r32
and vpbroadcastd
can only run on port 5 on Intel CPUs. But a store is micro-fused p237 + p4, and a broadcast-load (of 32-bit or wider elements) is handled purely in a load port, not ALU uop needed, so the total cost is still 2 front-end uops on Intel CPUs, with a cost of p237+p4 + p23
. Instead of 2p5
. Store-forwarding latency of ~5 or 6 cycles actually similar to 1 to 3-cycle vmovd
+ 3-cycle vpbroadcastd
so maybe this is worth considering for 32-bit and 64-bit broadcasts from registers, if there isn't much pressure on the load/store ports.
(Including maybe SSE3 movddup
broadcast-loads into XMM registers, although in-lane shuffles are only 1 cycle latency so movd + xmm shuffle is only about 4 cycle latency on Haswell and later.)
It's easy to measure the latency of a movd xmm, r
/ movd r, xmm
round trip, but hard to figure out which instruction has which latency. They might just be 1 cycle ALU plus a 1-cycle bypass delay on Skylake. Haswell apparently has 1-cycle movd
in each direction. https://uops.info/ just measures a not-very-tight upper bound on latency by putting it in a loop with instructions to create a loop-carried dependency, and assuming others have 1-cycle latency. https://agner.org/optimize/ makes a guess on how to split up latency for a pair of instructions. Perhaps one could do better by including store-forwarding for one direction and an ALU transfer for the other, but store-forwarding latency on Sandybridge-family is notoriously variable, faster if you don't try to reload right away. (e.g. useless stores can speed up the critical path through a store-forwarding bottleneck. Adding a redundant assignment speeds up code when compiled without optimization). And store-forwarding between an integer store and vmovd xmm
reload can't be assumed to have the same latency as an integer reload.
Skylake's movd
xmm<->eax round trip is a total of 4 cycle latency, up from 2 in Sandybridge / Haswell. That could be 2 and 2 with bypass delays, or 1 and 3 without telling us which direction is slower.
Zen's is 6 cycles, so maybe 3 cycles each way.
AVX512F vpbroadcastd ymm, r32
is single-uop (port 5), so it's very good if you have AVX512.