Search code examples
assemblyx86cpu-architecturellvm-mca

Why does LLVM-MCA measure an execution stall?


Consider the following instructions from the LLVM MCA documentation

vmulps      %xmm0, %xmm1, %xmm2
vhaddps     %xmm2, %xmm2, %xmm3
vhaddps     %xmm3, %xmm3, %xmm4

The docs contain fascinating discussion about the state transition of the instructions in a simulated loop of 3 iterations. According to the docs, the state transition is:

[0,0]     DeeER.    .    .   vmulps   %xmm0, %xmm1, %xmm2
[0,1]     D==eeeER  .    .   vhaddps  %xmm2, %xmm2, %xmm3
[0,2]     .D====eeeER    .   vhaddps  %xmm3, %xmm3, %xmm4
[1,0]     .DeeE-----R    .   vmulps   %xmm0, %xmm1, %xmm2
[1,1]     . D=eeeE---R   .   vhaddps  %xmm2, %xmm2, %xmm3
[1,2]     . D====eeeER   .   vhaddps  %xmm3, %xmm3, %xmm4
[2,0]     .  DeeE-----R  .   vmulps   %xmm0, %xmm1, %xmm2
[2,1]     .  D====eeeER  .   vhaddps  %xmm2, %xmm2, %xmm3
[2,2]     .   D======eeeER   vhaddps  %xmm3, %xmm3, %xmm4

D is instruction dispatch, e refers to an ongoing execution, E to its termination, and R to retiring. Furthermore, = means the instruction has been dispatched but is waiting to be executed, while - means the instruction waits to be retired.

Considering the sequence above, what prevents the first vhaddps instruction from being executed in the second cycle?

I cannot see a dependency between the prior instructions. Furthermore, the docs also highlight that the multiplication uses the JFPM and JFPU1 execution units, but the addition uses other resources, namely JFPA and JFPU0. Shouldn't the addition execute earlier?


Solution

  • vhaddps is 3 uops: 2 shuffles and a vertical add. The shuffle uops can only run on port 5 so there's a resource conflict. This is why vhaddps has one per 2-cycle throughput.

    (That's on Intel. AMD Zen 2 and later decode it as 4 uops for the XMM version, 3 for the YMM version, strangely. https://uops.info/).

    This is why it's so bad for horizontally summing one register, only useful like here with two different inputs.


    Also, terminology nitpick: LLVM-MCA doesn't "measure" anything, it simulates. At best it's a loose prediction for what real hardware will do. uiCA tends to be closer to modeling the parts of the pipeline that matter for CPUs it supports (Intel Sandybridge-family up to Ice Lake / Rocket Lake). Andreas Abel published a paper about it. https://www.uops.info/uiCA.html