I am familiar with two basic strategies for structuring the function prologue/epilogue:
sub rsp, n
), do the actual work, then move the stack pointer back (add rsp, n
) and ret
. (If there are many registers used in the body, there may additionally be some pushing and popping here.)With strategy 2, you can't call functions inside the body, unless you move the stack pointer where it is supposed to be, which defeats the savings, which is why it's usually only used for leaf functions.
But it occurs to me that there is a third strategy one could use:
mov rsi, AFTER; jump FUNCTION; AFTER:
for the call sequence, and in the function just jump rsi
at the end.In this method, we ignore the stack pointer completely, so we have no stack space, but for a small function that might be doable. It also requires a custom calling convention, but compilers can do that if they want to for internal functions.
Since it pairs jump
with jump
, it doesn't touch the return stack so the branch predictor should not be thrown off (although the indirect jump at the end might be slower than a return), and there is no overhead for the memory writes incurred by call
. Additionally, stackless functions can call other stackless functions (although not too much since you eventually run out of registers in which to store the return addresses, and there is a global optimization problem in ensuring that if A calls B then they use different return registers).
My question is: why don't compilers use method (3) more? AFAICT it doesn't ever show up when looking at functions compiled by gcc or clang. Is there a reason this calling convention is not usable?
Here's an attempt to benchmark both options.
.text
.align 8
subroutine:
inc %rdx
#ifdef REG_CALL
jmp *%rsi
#else
ret
#endif
reps = 1000000
.global main
main:
push %rbp
mov $reps, %ecx
xor %edx, %edx
.align 8
top:
.rept 1000
#ifdef REG_CALL
lea 0f(%rip), %rsi
jmp subroutine
0:
#else
call subroutine
#endif
.endr
dec %ecx
jnz top
lea format(%rip), %rdi
mov %rdx, %rsi
xor %eax, %eax
call printf
xor %eax, %eax
pop %rbp
ret
.data
format: .asciz "%ld calls done\n"
It does 1000 calls to the subroutine from varying return addresses, repeated one million times. Assemble with no options for traditional call/ret
, with -DREG_CALL
for your indirect jump proposal.
On an i7-8565U CPU @ 1.80GHz, traditional takes 1.6 seconds and REG_CALL
takes about 3.2. So your proposal seems to be about twice as slow.
As I mentioned in comments, I suspect the indirect branch predictor can't keep track of where the jmp *%rsi
is going to go.
Aside from runtime inefficiencies, Raymond Chen mentions another major disadvantage of this strategy in the comments:
Another problem is register assignment if one stackless function calls another. Not only must they use separate return address registers, their other register usages cannot conflict either. This would be practical only for small functions with few callers, at which point you may as well just inline them.