performance assembly x86 micro-optimization micro-architecture

Why jnz counts no cycle?

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:

global _start
_start:
    mov rcx,    10000000
.for_loop:              ; do {
    inc rax
    inc rbx
    dec rcx
    jnz .for_loop       ; } while (--rcx)

    xor rdi,    rdi
    mov rax,    60      ; _exit(0)
    syscall

I compile and run it with perf:

$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out

The output shows:

10,491,664      cycles

which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.

But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.

I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?

What I'm missing here?

Solution

First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.

.for_loop:              ; do {
    inc rax
    dec rcx
    lea rbx, [rbx+1]    ; doesn't touch flags, defeats macro-fusion
    jnz .for_loop       ; } while (--rcx)

This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)

See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.

Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.

Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.

So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)

10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.

And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.