performance assembly x86 x86-64 micro-optimization

Performance of assembly function with multiple RET

Does function such as this have negative effect on performance?

fn:
cmp rdi, 0
je lbl0
...
ret
lbl0:
...
ret

call fn

And this one?

fn0:
... ; no ret, fall through
fn1:
...
ret

Solution

Fall through is the most efficient thing you can do; it's just normal execution. The CPU can't even know the difference between "2 different functions" vs. labels within a function; it's all just machine code. Labels are zero-width, and just give you a way to refer to that address from elsewhere.

From a high level you could look at it as an optimized tailcall of the 2nd function like you'd do with jmp fn1 instead of call fn1; ret, and then of course optimizing away jmp +0 because jumping to the next instruction is architecturally a nop.

As for the first one, that's called "tail duplication" optimization, where multiple paths out of a function duplicate any necessary cleanup (pop rbx or whatever) and a ret, instead of running an extra jmp to reach a single copy of the cleanup.

Tail duplication costs code footprint (static code size) but results in fewer dynamic instructions executed per call. It doesn't generally hurt branch prediction; ret is predicted by a stack-like predictor that matches ret with call (i.e. it assumes that ret will return to the last call that executed.) As long as this is still true (which it is here), you don't have a problem. You have multiple ways out of the function, but exactly one of them runs for each call to it.

You can also do loop tail duplication where you branch inside the loop and each side of the branch separately has a dec ecx / jnz .top_of_loop (with any necessary jmp or whatever outside the loop).