Consider the following instructions from the LLVM MCA documentation
vmulps %xmm0, %xmm1, %xmm2
vhaddps %xmm2, %xmm2, %xmm3
vhaddps %xmm3, %xmm3, %xmm4
The docs contain fascinating discussion about the state transition of the instructions in a simulated loop of 3 iterations. According to the docs, the state transition is:
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
D
is instruction dispatch, e
refers to an ongoing execution, E
to its termination, and R
to retiring. Furthermore, =
means the instruction has been dispatched but is waiting to be executed, while -
means the instruction waits to be retired.
Considering the sequence above, what prevents the first vhaddps
instruction from being executed in the second cycle?
I cannot see a dependency between the prior instructions. Furthermore, the docs also highlight that the multiplication uses the JFPM
and JFPU1
execution units, but the addition uses other resources, namely JFPA
and JFPU0
. Shouldn't the addition execute earlier?
vhaddps
is 3 uops: 2 shuffles and a vertical add. The shuffle uops can only run on port 5 so there's a resource conflict. This is why vhaddps
has one per 2-cycle throughput.
(That's on Intel. AMD Zen 2 and later decode it as 4 uops for the XMM version, 3 for the YMM version, strangely. https://uops.info/).
This is why it's so bad for horizontally summing one register, only useful like here with two different inputs.
Also, terminology nitpick: LLVM-MCA doesn't "measure" anything, it simulates. At best it's a loose prediction for what real hardware will do. uiCA tends to be closer to modeling the parts of the pipeline that matter for CPUs it supports (Intel Sandybridge-family up to Ice Lake / Rocket Lake). Andreas Abel published a paper about it. https://www.uops.info/uiCA.html