assembly x86 cpu-architecture microbenchmark

Assembly `jmp rel8` vs `jmp rel32` performance

I have checked uops table (https://uops.info/table.html) and I found that TP for jmp rel8 is far greater than for jmp rel32. Does this mean that jmp rel8 is slower than jmp rel32 ?

jmp rel32

With unroll_count=500 and no inner loop

    Code:

       0:   e9 00 00 00 00          jmp    0x5

    Show nanoBench command
    Results:
        Instructions retired: 1.0
        Core cycles: 2.75
        Reference cycles: 2.05

jmp rel8

With unroll_count=500 and no inner loop

    Code:

       0:   eb 00                   jmp    0x2

    Show nanoBench command
    Results:
        Instructions retired: 1.0
        Core cycles: 5.84
        Reference cycles: 4.61

Solution

That's not a very representative measurement. One per 2 cycle throughput is normal for taken branches, or 1/clock for loop branches in tiny loops. But branch prediction can do worse with more branches per 16-byte block of code depending on the microarchitecture, so packing jmp next_instruction (jmp rel8=0) is bad. (Especially when you put 500 of them in a row, like in Slow jmp-instruction)

That 5.84 number looks like Alder Lake P-cores. They came up with different numbers for other uarches; it matters a lot which architecture you look at for something this low-level.

Anyway, I think the key point here is that https://uops.info/ doesn't benchmark taken jumps very well; they use the same test harness as for other instructions (unroll a lot of times), leading to poor results that don't really characterize it well.

Agner Fog's instruction tables report different numbers (https://agner.org/optimize/), e.g. 1-2 cycle throughput for relative jmp on Skylake and Ice Lake, same as most earlier Intel. That's realistic if you have jumps inside a loop, so it's the same few jump instructions that execute in sequence.

But uops.info measured 2.12c or 4.80c for Skylake, way higher, something you hopefully only run into with artificial microbenchmarks.