Search code examples
assemblyx86compiler-constructioncalling-conventionmicro-optimization

jump/jump compilation strategy for stackless functions. (Using a link register manually instead of call/ret)


I am familiar with two basic strategies for structuring the function prologue/epilogue:

  1. "Regular" functions: Move the stack pointer to the end of the stack frame (sub rsp, n), do the actual work, then move the stack pointer back (add rsp, n) and ret. (If there are many registers used in the body, there may additionally be some pushing and popping here.)
  2. "Leaf" functions: Same as (1) but don't move the stack pointer, saving two instructions.

With strategy 2, you can't call functions inside the body, unless you move the stack pointer where it is supposed to be, which defeats the savings, which is why it's usually only used for leaf functions.

But it occurs to me that there is a third strategy one could use:

  1. "Stackless" functions: Use mov rsi, AFTER; jump FUNCTION; AFTER: for the call sequence, and in the function just jump rsi at the end.

In this method, we ignore the stack pointer completely, so we have no stack space, but for a small function that might be doable. It also requires a custom calling convention, but compilers can do that if they want to for internal functions.

Since it pairs jump with jump, it doesn't touch the return stack so the branch predictor should not be thrown off (although the indirect jump at the end might be slower than a return), and there is no overhead for the memory writes incurred by call. Additionally, stackless functions can call other stackless functions (although not too much since you eventually run out of registers in which to store the return addresses, and there is a global optimization problem in ensuring that if A calls B then they use different return registers).

My question is: why don't compilers use method (3) more? AFAICT it doesn't ever show up when looking at functions compiled by gcc or clang. Is there a reason this calling convention is not usable?


Solution

  • Here's an attempt to benchmark both options.

        .text
        .align 8
        
    subroutine:
        inc %rdx
    #ifdef REG_CALL
        jmp *%rsi
    #else
        ret
    #endif
    
        reps = 1000000
        .global main
        
    main:
        push %rbp
        mov $reps, %ecx
        xor %edx, %edx
        .align 8
    top:
        .rept 1000
    #ifdef REG_CALL
        lea 0f(%rip), %rsi
        jmp subroutine
    0:  
    #else
        call subroutine
    #endif
        .endr
        dec %ecx
        jnz top
    
        lea format(%rip), %rdi
        mov %rdx, %rsi
        xor %eax, %eax
        call printf
        xor %eax, %eax
        pop %rbp
        ret
    
        .data
    format: .asciz "%ld calls done\n"
    

    It does 1000 calls to the subroutine from varying return addresses, repeated one million times. Assemble with no options for traditional call/ret, with -DREG_CALL for your indirect jump proposal.

    On an i7-8565U CPU @ 1.80GHz, traditional takes 1.6 seconds and REG_CALL takes about 3.2. So your proposal seems to be about twice as slow.

    As I mentioned in comments, I suspect the indirect branch predictor can't keep track of where the jmp *%rsi is going to go.

    Aside from runtime inefficiencies, Raymond Chen mentions another major disadvantage of this strategy in the comments:

    Another problem is register assignment if one stackless function calls another. Not only must they use separate return address registers, their other register usages cannot conflict either. This would be practical only for small functions with few callers, at which point you may as well just inline them.