When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:
FE Frontend_Bound: 37.21 +- 0.00 % Slots
BAD Bad_Speculation: 23.62 +- 0.00 % Slots
BE Backend_Bound: 7.33 +- 0.00 % Slots below
RET Retiring: 31.82 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 26.55 +- 0.00 % Slots
FE Frontend_Bound.Frontend_Bandwidth: 10.62 +- 0.00 % Slots
BAD Bad_Speculation.Branch_Mispredicts: 23.72 +- 0.00 % Slots
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 1.59 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 5.73 +- 0.00 % Slots below
RET Retiring.Base: 31.54 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.28 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.70 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.62 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.04 +- 0.00 % Clocks_Estimated <==
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.57 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.76 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.36 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 26.79 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 6.53 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.03 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.37 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 2.46 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 0.22 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.01 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 28.53 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
This binary takes around 4.7 seconds to run.
If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:
FE Frontend_Bound: 17.47 +- 0.00 % Slots below
BAD Bad_Speculation: 28.55 +- 0.00 % Slots
BE Backend_Bound: 12.02 +- 0.00 % Slots
RET Retiring: 34.21 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency: 6.10 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Bandwidth: 11.31 +- 0.00 % Slots below
BAD Bad_Speculation.Branch_Mispredicts: 29.19 +- 0.00 % Slots <==
BAD Bad_Speculation.Machine_Clears: 0.01 +- 0.00 % Slots below
BE/Mem Backend_Bound.Memory_Bound: 4.58 +- 0.00 % Slots below
BE/Core Backend_Bound.Core_Bound: 7.44 +- 0.00 % Slots below
RET Retiring.Base: 33.70 +- 0.00 % Slots below
RET Retiring.Microcode_Sequencer: 0.50 +- 0.00 % Slots below
FE Frontend_Bound.Frontend_Latency.ICache_Misses: 0.55 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.ITLB_Misses: 0.58 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.Branch_Resteers: 5.72 +- 0.00 % Clocks_Estimated below
FE Frontend_Bound.Frontend_Latency.DSB_Switches: 0.17 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.LCP: 0.00 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Latency.MS_Switches: 0.40 +- 0.00 % Clocks below
FE Frontend_Bound.Frontend_Bandwidth.MITE: 0.68 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.DSB: 42.01 +- 0.00 % CoreClocks below
FE Frontend_Bound.Frontend_Bandwidth.LSD: 0.00 +- 0.00 % CoreClocks below
BE/Mem Backend_Bound.Memory_Bound.L1_Bound: 7.60 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L2_Bound: -0.04 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.L3_Bound: 0.70 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.DRAM_Bound: 0.71 +- 0.00 % Stalls below
BE/Mem Backend_Bound.Memory_Bound.Store_Bound: 1.85 +- 0.00 % Stalls below
BE/Core Backend_Bound.Core_Bound.Divider: 0.02 +- 0.00 % Clocks below
BE/Core Backend_Bound.Core_Bound.Ports_Utilization: 17.38 +- 0.00 % Clocks below
RET Retiring.Base.FP_Arith: 0.02 +- 0.00 % Uops below
RET Retiring.Base.Other: 99.98 +- 0.00 % Uops below
RET Retiring.Microcode_Sequencer.Assists: 0.00 +- 0.00 % Slots_Estimated below
MUX: 100.00 +- 0.00 %
warning: 6 results not referenced: 67 71 72 85 87 88
By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops). However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.
PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing. The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):
606: eb 15 jmp 61d <main+0x7d>
608: 0f 1f 84 00 00 00 00 nop DWORD PTR [rax+rax*1+0x0]
60f: 00
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
624: 48 8d 0c 36 lea rcx,[rsi+rsi*1]
628: 48 81 f9 00 20 00 00 cmp rcx,0x2000
62f: 77 20 ja 651 <main+0xb1>
631: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
636: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
63d: 00 00 00
640: 41 c6 04 08 00 mov BYTE PTR [r8+rcx*1],0x0
645: 48 01 f1 add rcx,rsi
648: 48 81 f9 00 20 00 00 cmp rcx,0x2000
64f: 7e ef jle 640 <main+0xa0>
Your assembly code reveals why the bandwidth DSB metric is very high (i.e., in 42.01% of all core cycles in which the DSB is active, the DSB delivers less than 4 uops). The issue seems to exist in the following loop:
610: 48 83 c6 01 add rsi,0x1
614: 48 81 fe 01 20 00 00 cmp rsi,0x2001
61b: 74 ad je 5ca <main+0x2a>
61d: 41 80 3c 30 00 cmp BYTE PTR [r8+rsi*1],0x0
622: 74 ec je 610 <main+0x70>
This loop is aligned on a 16 byte boundary despite passing -falign-loops=32
to the compiler. Also the last instruction crosses a 32-byte boundary, which means that it will be stored in a different cache set in the DSB. The DSB can only deliver uops to the IDQ from one set in the same cycle. So it will deliver add
and cmp/je
in one cycle and the second cmp/je
in the next cycle. In both cycles, the DSB bandwdith is less than 4 uops.
However, the LSD is supposed to hide such limitations. But it seems that it's not active. The loop contains two jump instructions. The first one seems to check whether the size of the array (0x2001 bytes) has been reached and the second one seems to check whether a non-zero byte-wide element has been reached. A maximum trip count of 0x2001 gives ample time for the LSD to detect the loop and lock it down in the IDQ. On the other hand, if the probability that a non-zero element is found before the LSD detects a loop, then the uops will be either delivered from the DSB path or the MITE path. In this case, it seems that they are being delivered from the DSB path. And because the loop body crosses a 32-byte boundary, it takes 2 cycles to execute one iteration (compared to a maximum of one cycle if the loop had been 32-byte aligned since there are two jump execution ports on Broadwell). I think if you align this loop to 32 bytes, the bandwidth DSB metric will improve, not because the DSB will deliver 4 uops per cycle (it will deliver only 3 uops per cycle) but because it may take a smaller number of cycles to execute the loop.
Even if you somehow changed the code so that the uops get delivered from the LSD instead, you can still not do better than 1 cycle per iteration, despite the fact that the LSD in Broadwell can deliver uops across loop iterations (in contrast to the DSB I think). That's because you will hit another bottleneck: at most two jumps can be allocated in one cycle (See: Can the LSD issue uOPs from the next iteration of the detected loop?). So the bandwidth LSD metric will become larger while the bandwidth DSB metric will become smaller. This just changes the bottleneck, but does not improve performance (although it may improve power consumption). There is no way to improve the frontend bandwidth of this loop other than moving work from some place to the loop.
For information on the LSD, see Why jnz requires 2 cycles to complete in an inner loop.