performance caching x86 cpu-architecture amd-processor

Why is my loop much faster when it is contained in one cache line?

When I run this little assembly program on my Ryzen 9 3900X:

_start:   xor       rax, rax
          xor       rcx, rcx
loop0:    add       rax, 1
          mov       rdx, rax
          and       rdx, 1
          add       rcx, rdx
          cmp       rcx, 1000000000
          jne       loop0

It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:

round((address of loop0)/64) == round((address of end of jne-instruction)/64)

However, if the above condition does not hold, the loop takes 900 ms instead.

I've made a repo with the code https://github.com/avl/strange_performance_repro .

Why is the inner loop much slower in some specific cases?

Edit: Removed a claim with a conclusion from a mistake in testing.

Solution

Your issue lies in the variable cost of the jne instruction.

First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip. This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.

Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):

                                        # Reciprocal throughput
loop0:    add       rax, 1              # 0.25
          mov       rdx, rax            # 0.2
          and       rdx, 1              # 0.25
          add       rcx, rdx            # 0.25
          cmp       rcx, 1000000000     # 0.25  | Fused  0.5-2 (2 if jumping)
          jne       loop0               # 0.5-2 |

As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one. Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.

Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed. Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).

From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:

2.8.3 Loop Alignment:

For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.
2.8.1.1 Next Address Logic

The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.

Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?

You say that the program takes about 450 ms. Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.

If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.

This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).