From Intel's introduction to x64 assembly at https://software.intel.com/en-us/articles/introduction-to-x64-assembly,
While I understand how RCX, RDX, R8, R9 are used as function arguments, I've seen functions that take more than 4 arguments revert to using the stack like 32 bit code. An example is below:
sub_18000BF10 proc near
lpDirectory = qword ptr -638h
nShowCmd = dword ptr -630h
Parameters = word ptr -628h
sub rsp, 658h
mov r9, rcx
mov r8, rdx
lea rdx, someCommand ; "echo "Hello""...
lea rcx, [rsp+658h+Parameters] ; LPWSTR
xor r11d, r11d
lea r9, [rsp+658h+Parameters] ; lpParameters
mov [rsp+658h+nShowCmd], r11d ; nShowCmd
lea r8, aCmdExe ; "cmd.exe"
lea rdx, Operation ; "open"
xor ecx, ecx ; hwnd
mov [rsp+658h+lpDirectory], r11 ; lpDirectory
mov eax, 1
add rsp, 658h
This is an excerpt from IDA, and you can see the nShowCmd and lpDirectory arguments to ShellExecute are on the stack. Why cant we use the extra registers after R9 for fast-call behavior?
Or if we can do that in user-defined functions and the system API functions don't do that, is there a reason for it? I imagine fast-call arguments in registers would be more efficient than checking, offsetting the stack.
The Windows x64 calling convention is designed to make it easy to implement variadic functions (like printf and scanf) by dumping the 4 register args into the shadow space, creating a contiguous array of all args. Args larger than 8 bytes are passed by reference, so each arg always takes exactly 1 arg-passing slot.
Given this design constraint, more register args would require a larger shadow space, which wastes more stack space for small functions that don't have a lot of args.
Yes, more register args would normally be more efficient. But if the callee wants to make another function call right away with different args, it would then have to store all its register args to the stack or copy to call-preserved regs it saved, so there are downsides to having to register args that get worse as you have more of them.
You want a good mix of call-preserved and call-clobbered registers, regardless of how many are used for arg-passing. R10 and R11 are call-clobbered scratch regs. A transparent wrapper function written in asm might use them for scratch space without disturbing any of the args in RCX,RDX,R8,R9, and without needing to save/restore a call-preserved register anywhere.
R12..R15 are call-preserved registers you can use for whatever you want, as long as you restore them before returning. Same as RSI, RDI, RBX, and RBP.
Or if we can do that in user-defined functions
Yes, you can freely make up your own calling conventions when calling from asm to asm, subject to constraints imposed by the OS. But if you want exceptions to be able to unwind the stack through such a call (e.g. if one of the child functions calls back into some C++ that can throw), you have to follow more restrictions, such as creating unwind metadata. If not, you can do nearly anything.
See my Choose your calling convention to put args where you want them. answer on the CodeGolf Q&A "Tips for golfing in x86/x64 machine code".
You can also return in whatever register(s) you want, and return multiple values. (e.g. an asm
memcmp function can return the -/0/+ difference in the mismatch in EAX, and return the mismatch position in RDI, so the caller can use either or both.)
By comparison, the x86-64 System V ABI passes the first 6 integer args in registers, and the first 8 FP args in XMM0..7. (Windows x64 passes the 5th arg on the stack, even if it's FP and the first 4 args were all integer.)
So the other major x86-64 calling convention does use more arg-passing registers. It doesn't use shadow-space; it defines a red-zone below RSP that's safe from being asynchronously clobbered. Small leaf functions can still avoid manipulating RSP to reserve space.
Fun fact: R10 and R11 are also non-arg-passing call-clobbered registers in x86-64 SysV. Fun fact #2:
syscall destroys R11 (and RCX), so Linux uses R10 instead of RCX for passing arguments to system calls, but otherwise uses the same register-arg passing convention as user-space function calls.
See also Why does Windows64 use a different calling convention from all other OSes on x86-64? for more guesswork and info about why Microsoft made the design choices they did with their calling convention.
x86-64 System V makes it more complex to implement variadic functions (more code to index args), but they're generally rare. Most code doesn't bottleneck on
sscanf throughput. Shadow space is usually worse than a red-zone. The original Windows x64 convention doesn't pass vector args (
__m128) by value, so there's a 2nd 64-bit calling convention on Windows called
vectorcall that allows efficient vector args. (Not usually a big deal because most functions that take vector args are inline, but SIMD math library functions would benefit.)
Having more args passed in the low 8 (rax..rdi original registers that don't need a REX prefix), and having more call-clobbered registers that don't need a REX prefix, is probably good for code-size in code that inlines enough to not make a huge amount of function calls. You could say that Window's choice of having more of the non-REX registers be call-preserved is better for code with loops containing function calls, but if you're making lots of function calls to short callees, then they'd benefit from more call-clobbered scratch registers that didn't need REX prefixes. I wonder how much thought MS put into this, or if they just mostly kept things similar to 32-bit calling conventions when choosing which of the low-8 registers would be call-preserved.
One of x86-64 System V's weaknesses is having no call-preserved XMM registers, though. So any function call requires spilling/reloading any FP vars. Having a couple, like the low 128 or 64 bits of xmm6 and xmm7, would have been maybe good.