x86 x86-64 cpu-architecture micro-optimization branch-prediction

Cost of a 64bits jump, always 10-22 cycles the first time?

In x86_64 there is no direct jump with a 64 bits address. Only a 32 bits one. With indirect jumps I understand the pipeline HAS TO BE RESOLVED ONCE before branch prediction comes into play. My question is : is there no way in 64 bits to do a 1-3 cycles jump, at the first execution ?

Solution

Direct jumps aren't always that cheap "the first time", even without I-cache misses. They still need branch prediction.

In long mode, jcc rel32 and jmp rel32 (and the rel8 compact versions) use a sign-extended relative displacement from RIP. You can jump to any 64-bit address, as long as you're coming from an address within 2GB. So keep your code within 2GB of other code so you can use rel32 displacements.

There are no absolute direct jumps in long mode. 32-bit mode's far JMP ptr16:32 (opcode 0xEA) and far CALL ptr16:32 don't have 64-bit versions at all. (And you don't want a far jmp anyway, for performance and convenience.) Instructions like SYSCALL and INT are indirect jumps (with an implicit destination), and aren't useful anyway.

There's also no instruction-prefetch/predecode instruction to get the target hot in L1 I-cache or the uop cache, or any way to hint the pipeline that decoded instructions from a given address will be needed soon.

See the PREDECODE wishlist section in Darek Mihocka's article about indirect jump in emulators, where it's useful to have the handler for one guest instruction jump right to the handler for the next guest instruction, instead of having one indirect-call dispatch instruction that will almost always mispredict. (Or at least it was useful when Mihocka wrote that, before IT-TAGE branch predictors more or less solved that problem (in Intel Haswell and later, AMD Zen or Zen2): Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore 2015 by Rohou, Swamy, and Seznec.)

Direct jumps

Even direct jumps need the branch-target-buffer to predict that the next fetch-block should come from somewhere else. This information is needed much earlier than the decode stage, so it has to be predicted to avoid significant front-end bubbles. An interesting question brought up this issue recently: Slow jmp-instruction. The replies on the Realworldtech forum thread make it clear that branch prediction needs to work on fetch blocks, not just instructions, and that even on a simple-to-decode fixed-insn-width ISA (unlike x86), you need prediction earlier than decode results can be available.

1-3 cycles is unrealistic for the size of the code-fetch bubble for a newly-seen direct (rel32) jump. Part of that bubble may be hidden by the decoded-uop queue, though.

Code-fetch to decode is probably at least 5 or 6 cycles, and probably more. Let's say L1-I hit time is 4 cycles, same as Haswell's L1D load-use latency. Then Intel CPUs pre-decode to mark instruction boundaries, and then the decode stage decodes up to 4 uops. David Kanter's Haswell writeup has a diagram of the frontend.

The OP's data from the Slow jmp-instruction question indicates that a huge block of nothing but JMP instructions runs at about one JMP per 12 clocks on Intel Broadwell (with branch target=next insn), so that's your worst-case scenario where fetch/decode bubbles can't be hidden at all because you're not doing anything else that gives the frontend time to catch up.

I'm assuming we're talking about running from the legacy decoders. A BTB miss while running from the uop cache might be slightly shorter, since the decoded uop is available faster. If the branch target also hits in the uop cache, that's also fewer cycles before decoded uops can start entering the decoded uop queue (the same buffer that's used as a loop buffer).

If the decoded-uop queue doesn't empty during the code-fetch bubble, then there might not be any bubble in the issue stage (sending uops into the out-of-order part of the CPU).

Or if the OOO part has a lot of un-executed uops to work on (i.e. the CPU is executing some code with bottlenecks that limit IPC to much less than front-end bandwidth), a front-end bubble might not affect it too much.

Indirect branches are worse, though. The correct target can't be detected until a few cycles later at best, when the jmp uop executes in the back-end, to check the prediction. Recovering from a mispredict involves rolling back any independent work from the wrong path that got executed, unlike re-steering the front-end before any wrong-path instructions/uops even get issued.

Your basic premise is correct: indirect branches are not cheap, and should be avoided when possible. (Although one indirect branch can be cheaper than a short chain of conditional branches, e.g. in this example.)