performance x86 intel compiler-optimization perf

Trouble understanding and comparing CPU performance metrics

When running toplev, from pmu-tools on a piece of software (compiled with gcc: gcc -g -O3) I get this output:

FE             Frontend_Bound:                                          37.21 +-     0.00 % Slots                
BAD            Bad_Speculation:                                         23.62 +-     0.00 % Slots                
BE             Backend_Bound:                                            7.33 +-     0.00 % Slots below          
RET            Retiring:                                                31.82 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency:                         26.55 +-     0.00 % Slots                
FE             Frontend_Bound.Frontend_Bandwidth:                       10.62 +-     0.00 % Slots                
BAD            Bad_Speculation.Branch_Mispredicts:                      23.72 +-     0.00 % Slots                
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below          
BE/Mem         Backend_Bound.Memory_Bound:                               1.59 +-     0.00 % Slots below          
BE/Core        Backend_Bound.Core_Bound:                                 5.73 +-     0.00 % Slots below          
RET            Retiring.Base:                                           31.54 +-     0.00 % Slots below          
RET            Retiring.Microcode_Sequencer:                             0.28 +-     0.00 % Slots below          
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.70 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.62 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.04 +-     0.00 % Clocks_Estimated      <==
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.57 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.76 +-     0.00 % Clocks below         
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.36 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   26.79 +-     0.00 % CoreClocks below     
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below     
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      6.53 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.03 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.37 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    2.46 +-     0.00 % Stalls below         
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   0.22 +-     0.00 % Stalls below         
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.01 +-     0.00 % Clocks below         
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              28.53 +-     0.00 % Clocks below         
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below           
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below           
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below
               MUX:                                                    100.00 +-     0.00 %                      
warning: 6 results not referenced: 67 71 72 85 87 88

This binary takes around 4.7 seconds to run.

If I add the following flag to gcc: -falign-loops=32, the binary takes now around 3.8 seconds to run, and this is the output from toplev:

FE             Frontend_Bound:                                          17.47 +-     0.00 % Slots below           
BAD            Bad_Speculation:                                         28.55 +-     0.00 % Slots                 
BE             Backend_Bound:                                           12.02 +-     0.00 % Slots                 
RET            Retiring:                                                34.21 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency:                          6.10 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Bandwidth:                       11.31 +-     0.00 % Slots below           
BAD            Bad_Speculation.Branch_Mispredicts:                      29.19 +-     0.00 % Slots                  <==
BAD            Bad_Speculation.Machine_Clears:                           0.01 +-     0.00 % Slots below           
BE/Mem         Backend_Bound.Memory_Bound:                               4.58 +-     0.00 % Slots below           
BE/Core        Backend_Bound.Core_Bound:                                 7.44 +-     0.00 % Slots below           
RET            Retiring.Base:                                           33.70 +-     0.00 % Slots below           
RET            Retiring.Microcode_Sequencer:                             0.50 +-     0.00 % Slots below           
FE             Frontend_Bound.Frontend_Latency.ICache_Misses:            0.55 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.ITLB_Misses:              0.58 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.Branch_Resteers:          5.72 +-     0.00 % Clocks_Estimated below
FE             Frontend_Bound.Frontend_Latency.DSB_Switches:             0.17 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.LCP:                      0.00 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Latency.MS_Switches:              0.40 +-     0.00 % Clocks below          
FE             Frontend_Bound.Frontend_Bandwidth.MITE:                   0.68 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.DSB:                   42.01 +-     0.00 % CoreClocks below      
FE             Frontend_Bound.Frontend_Bandwidth.LSD:                    0.00 +-     0.00 % CoreClocks below      
BE/Mem         Backend_Bound.Memory_Bound.L1_Bound:                      7.60 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L2_Bound:                     -0.04 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.L3_Bound:                      0.70 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.DRAM_Bound:                    0.71 +-     0.00 % Stalls below          
BE/Mem         Backend_Bound.Memory_Bound.Store_Bound:                   1.85 +-     0.00 % Stalls below          
BE/Core        Backend_Bound.Core_Bound.Divider:                         0.02 +-     0.00 % Clocks below          
BE/Core        Backend_Bound.Core_Bound.Ports_Utilization:              17.38 +-     0.00 % Clocks below          
RET            Retiring.Base.FP_Arith:                                   0.02 +-     0.00 % Uops below            
RET            Retiring.Base.Other:                                     99.98 +-     0.00 % Uops below            
RET            Retiring.Microcode_Sequencer.Assists:                     0.00 +-     0.00 % Slots_Estimated below 
               MUX:                                                    100.00 +-     0.00 %                       
warning: 6 results not referenced: 67 71 72 85 87 88

By adding that flag, the Frontend Latency has improved (as we can see from the toplev output). I understand that by adding that flag, now the loops are aligned to 32 bytes, and the DSB is hit more frequently when running tight loops (the code spends its time mostly in a couple of small loops). However I don't understand why the metric Frontend_Bound.Frontend_Bandwidth.DSB has gone up (the description for that metric is: "This metric represents Core fraction of cycles in which CPU was likely limited due to DSB (decoded uop cache) fetch pipeline"). I would have expected that metric to go down, as the use of the DSB is precisely what I'm improving by adding the gcc flag.

PS: when running toplev I've used --no-multiplex, to minimize errors caused by multiplexing. The target architecture is Broadwell, and the assembly of the loops is the following (Intel syntax):

 606:   eb 15                   jmp    61d <main+0x7d>
 608:   0f 1f 84 00 00 00 00    nop    DWORD PTR [rax+rax*1+0x0]
 60f:   00 
 610:   48 83 c6 01             add    rsi,0x1
 614:   48 81 fe 01 20 00 00    cmp    rsi,0x2001
 61b:   74 ad                   je     5ca <main+0x2a>
 61d:   41 80 3c 30 00          cmp    BYTE PTR [r8+rsi*1],0x0
 622:   74 ec                   je     610 <main+0x70>
 624:   48 8d 0c 36             lea    rcx,[rsi+rsi*1]
 628:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 62f:   77 20                   ja     651 <main+0xb1>
 631:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
 636:   66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
 63d:   00 00 00 
 640:   41 c6 04 08 00          mov    BYTE PTR [r8+rcx*1],0x0
 645:   48 01 f1                add    rcx,rsi
 648:   48 81 f9 00 20 00 00    cmp    rcx,0x2000
 64f:   7e ef                   jle    640 <main+0xa0>

Solution

Your assembly code reveals why the bandwidth DSB metric is very high (i.e., in 42.01% of all core cycles in which the DSB is active, the DSB delivers less than 4 uops). The issue seems to exist in the following loop:

 610:   48 83 c6 01             add    rsi,0x1
 614:   48 81 fe 01 20 00 00    cmp    rsi,0x2001
 61b:   74 ad                   je     5ca <main+0x2a>
 61d:   41 80 3c 30 00          cmp    BYTE PTR [r8+rsi*1],0x0
 622:   74 ec                   je     610 <main+0x70>

This loop is aligned on a 16 byte boundary despite passing -falign-loops=32 to the compiler. Also the last instruction crosses a 32-byte boundary, which means that it will be stored in a different cache set in the DSB. The DSB can only deliver uops to the IDQ from one set in the same cycle. So it will deliver add and cmp/je in one cycle and the second cmp/je in the next cycle. In both cycles, the DSB bandwdith is less than 4 uops.

However, the LSD is supposed to hide such limitations. But it seems that it's not active. The loop contains two jump instructions. The first one seems to check whether the size of the array (0x2001 bytes) has been reached and the second one seems to check whether a non-zero byte-wide element has been reached. A maximum trip count of 0x2001 gives ample time for the LSD to detect the loop and lock it down in the IDQ. On the other hand, if the probability that a non-zero element is found before the LSD detects a loop, then the uops will be either delivered from the DSB path or the MITE path. In this case, it seems that they are being delivered from the DSB path. And because the loop body crosses a 32-byte boundary, it takes 2 cycles to execute one iteration (compared to a maximum of one cycle if the loop had been 32-byte aligned since there are two jump execution ports on Broadwell). I think if you align this loop to 32 bytes, the bandwidth DSB metric will improve, not because the DSB will deliver 4 uops per cycle (it will deliver only 3 uops per cycle) but because it may take a smaller number of cycles to execute the loop.

Even if you somehow changed the code so that the uops get delivered from the LSD instead, you can still not do better than 1 cycle per iteration, despite the fact that the LSD in Broadwell can deliver uops across loop iterations (in contrast to the DSB I think). That's because you will hit another bottleneck: at most two jumps can be allocated in one cycle (See: Can the LSD issue uOPs from the next iteration of the detected loop?). So the bandwidth LSD metric will become larger while the bandwidth DSB metric will become smaller. This just changes the bottleneck, but does not improve performance (although it may improve power consumption). There is no way to improve the frontend bandwidth of this loop other than moving work from some place to the loop.

For information on the LSD, see Why jnz requires 2 cycles to complete in an inner loop.