Fastest (CPU-wise) way to do functions in intel x64 assembly?

I've been reading about assembly functions and I'm confused as to whether to use the enter and exit or just the call/return instructions for fast execution. Is one way fast and the other smaller? For example what is the fastest (stdcall) way to do this in assembly without inlining the function:

static Int32 Add(Int32 a, Int32 b) {
   return a + b;
}

int main() {
    Int32 i = Add(1, 3);
}

Solution

Use call / ret, without making a stack frame with either enter / leave or push&pop rbp / mov rbp, rsp. gcc (with the default -fomit-frame-pointer) only makes a stack frame in functions that do variable-size allocation on the stack. This may make debugging slightly more difficult, since gcc normally emits stack unwind info when compiling with -fomit-frame-pointer, but your hand-written asm won't have that. Normally it only makes sense to write leaf functions in asm, or at least ones that don't call many other functions.

Stack frames mean you don't have to keep track of how much the stack pointer has changed since function entry to access stuff on the stack (e.g. function args and spill slots for locals). Both Windows and Linux/Unix 64bit ABIs pass the first few args in registers, and there are often enough regs that you don't have to spill any variables to the stack. Stack frames are a waste of instructions in most cases. In 32bit code, having ebp available (going from 6 to 7 GP regs, not counting the stack pointer) makes a bigger difference than going from 14 to 15. Of course, you still have to push/pop rbp if you do use it, though, because in both ABIs it's a callee-saved register that functions aren't allowed to clobber.

If you're optimizing x86-64 asm, you should read Agner Fog's guides, and check out some of the other links in the x86 tag wiki.

The best implementation of your function is probably:

align 16
global Add
Add:
    lea   eax, [rdi + rsi]
    ret
    ; the high 32 of either reg doesn't affect the low32 of the result
    ; so we don't need to zero-extend or use a 32bit address-size prefix
    ; like  lea  eax, [edi, esi]
    ; even if we're called with non-zeroed upper32 in rdi/rsi.

align 16
global main
main:
    mov    edi, 1   ; 1st arg in SysV ABI
    mov    esi, 3   ; 2nd arg in SysV ABI
    call Add
    ; return value in eax in all ABIs
    ret

align 16
OPmain:  ; This is what you get if you don't return anything from main to use the result of Add
    xor   eax, eax
    ret

This is in fact what gcc emits for Add(), but it still turns main into an empty function, or into a return 4 if you return i. clang 3.7 respects -fno-inline-functions even when the result is a compile-time constant. It beats my asm by doing tail-call optimization, and jmping to Add.

Note that the Windows 64bit ABI uses different registers for function args. See the links in the x86 tag wiki, or Agner Fog's ABI guide. Assembler macros may help for writing functions in asm that use the correct registers for their args, depending on the platform you're targeting.