assembly x86-64 cpu-registers calling-convention abi

Acceptability of regular usage of r10 and r11

I have been recently doing a lot of x64 assembly programming (on Linux) for integration with my C/C++ programs.

Since I am mostly concerned about efficiency I like to use as few different regs/memory addresses as possible as well as trying not to create any stack frames or preserve registers (every cycle counts).

According to the cdecl r10 and r11 registers are not preserved and I wish to use them as temporary variables in my functions preferably without preserving. Does it cause any incomparability issues / bugs with any compiler (haven't experienced any so far but it is a concern)?

Solution

Both r10 and r11 are call-clobbered registers, aka volatile, you can overwrite them without saving/restoring in any leaf or non-leaf function. That's what C compilers do, and expect from functions their code calls, because that's what the ABI doc says: What registers are preserved through a linux x86-64 function call

Your caller will expect them to hold garbage after return. Just like arg-passing registers such as RDI or RCX. (And RDX if it's not part of a wide RDX:RAX return value.)

The x86-64 System V ABI doesn't name its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.

r11 is a temporary, aka call-clobbered register. r11 is never used for passing or returning anything, so it's safe even for wrapper / trampoline / hook functions to clobber it even if they want to forward all args and return all return values, unlike any other register. For example, lazy dynamic linker code.

r10 is also call-clobbered. The ABI says "used for passing a function’s static chain pointer". In languages that use nested functions, this is an extra incoming arg to such functions so they can find the local vars of the outer scope. A pointer-to-nested-function needs a code pointer and a static chain pointer for the caller to pass when dereferencing.

It's a "chain" because there can be multiple levels of nesting with their stack frames forming a linked list, and "static" because it's based on lexical scope, not the call stack. (Thanks @Raymond Chen for explaining the terminology.)

But like all arg-passing registers to function calls (not system calls) in x86-64 System V, it's call-clobbered (like in most calling conventions generally). You only need to worry about this usage of r10 if you're hooking or wrapping nested functions, ones defined inside another function. If you're just writing a function that's called normally, it's a pure temporary.

GCC does use r10 as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions (unlike C and C++) would probably have the caller aware of it (like a lambda / closure) and passing a value in r10 when using using pointer to a nested function.

In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31 / k0..7, and the mmx/x87 stack, and the condition codes in RFLAGS).

Note that r8..15 need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.

Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10 as a zeroing idiom, so use xor r10d,r10d even though it doesn't save code size.)

If you do run out of low registers, ideally use r8..r11 for things that will normally be used with 64-bit operand-size (or VEX prefixes) anyway. e.g. pointers to 64-bit data or pointers to pointers. mov eax, [r10] needs a REX prefix while mov eax, [rdi] doesn't. But mov rax, [rdi] and mov r8, [r10] are the same size.

It's hard to gain much because you often need to use different values together in different combinations, like eventually using cmp eax, r10d or whatever, but if you want to go all-out on optimizing, then think about code-size.

Maybe also think about where the instruction boundaries are and how it will fit into the uop cache. (And the JCC erratum for Skylake.)

See the x86 tag wiki, and especially http://agner.org/optimize/ for tips on writing efficient code.