I have a function that itself needs two arguments and forwards the rest to another function. I'm using the fastcall convention. I have tried this:
forward:
push r14 # r14 and r13 will stored arguments used by the function itself
push r13
mov r13, rcx
mov r14, rdx
mov rcx, r8
mov rdx, r9
mov r8, [rsp + 40 + 16] # 32 bytes of shadow space + 8 bytes of return address + 16 bytes because of registers pushed onto the stack
mov r9, [rsp + 48 + 16]
# ... some operations on arguments passed to this function
# Add to rsp, therefore making arguments that are supposed to be forwarded and were passed on the stack, accessible at the expected offset
add rsp, 40 # 16 - preserved registers,
# 8 - the current return address,
# 8 - return address pushed by call instruction,
# 8 - account for the stack frame of the called function? (I don't quite understand this one)
call foo
sub rsp, 40
pop r13
pop r14
ret
This works(kind of), but because I increased the stack pointer the registers that I wanted to preserve and also the return address to the forward function could be overridden(at least I think so), so this is not an acceptable solution. I'm kind of new to assembly, so maybe the approach I took is invalid, also I don't know if I understand all of this stuff correctly, so feel free to correct me if I'm wrong. Is there any other way to "move" the arguments pushed on the stack, to an offset that's expected by the callee?(Or a fix that, will not cause an access violation). One thing to note is that I can not use a stack frame.
Do you have to use a variadic function? Could you have the caller pass a pointer to a variable-length union
, or a byte buffer that the callee can decode however it wants? Or maybe a va_list
like vprintf
takes.
Is your caller also written in assembly? If so, it could pass the extra 2 args in two different registers, like RAX and R10, leaving everything else ready for a jmp foo
tailcall to the other function.
add rsp, 40
is fatally flawed. You just pointed RSP above the data you just pushed, and above your own return address. Then you did call foo
which uses an unbounded amount of stack space below RSP, overwriting all of that stuff which you're later going to try to pop
.
incoming RSP+40: first incoming stack arg
incoming RSP+32: top qword of shadow space (for your function, allocated by your caller)
incoming RSP+24: shadow_space[2] <--- RSP pointer here after add rsp, 40
incoming RSP+16: shadow_space[1] <--- call foo writes its return address here
incoming RSP+8: low qword of shadow space
incoming RSP+0: your return address
incoming RSP-8: saved r14
incoming RSP-16: saved r13 <--- RSP points here after push/push
After add rsp, 40
, call foo
pushes a return address into stack space you own (your shadow space). And yes, 32 bytes above that is the 3rd stack arg, 1st after shifting args by 2.
But the function only has 1 qword (8 bytes) of usable stack space (other than the return address) before it starts overwriting the stuff you're planning to pop later, including your own return address.
This could almost work if your callee is also written in assembly and either doesn't use more than 1 push
worth of stack space, or if it starts with sub rsp, 40
to realign the stack and move it below the stuff it needs to not overwrite.
But if that's the case, callee written in asm, you can just use a custom calling convention where it knows where to look for args. Or even pass it a pointer to the start of stack args, e.g. in RAX or R10 or something.
I said "almost work" because Windows x64 doesn't have a red-zone: you have to assume anything below RSP gets wiped out asynchronously by an SEH exception, or a debugger evaluating an expression like print foo(123)
that runs code in your process using the current stack. See Is it valid to write below ESP? (No, apparently not ever, you can't just not install any exception handlers like you could in Linux with signal handlers. And that answer should AFAIK apply to x64 processes as well.)
It would work to copy all your stack args 16 bytes lower on the stack (and shuffle registers like you're doing) before a jmp foo
tailcall. But it's not ok to memmove data in the parent function's stack frame, so you'd need to know how many bytes of args to copy.
Windows x64 has the interesting property that every arg takes exactly one 8-byte stack slot (wider args are passed by pointer). But you still have no way to know how many args you received (unless one of those first 2 args tells you). At least this lets you just pass an arg count instead of an arg size in bytes.
You aren't handling XMM args, but you don't need to because Windows x64 requires variadic XMM args to be duplicated to the corresponding integer registers. And in practice variadic functions just dump the integer args into shadow space to form an array of args on the stack, not using the XMM args. So it's fine if they don't match. (x86-64 System V is much harder: the first 8 FP args go in XMM regs, even if they're not one of the first 8 args overall. Nice for non-variadic functions, though.)
For the memmove
, a loop of 16-byte movaps
load/store instructions is probably your best bet. Or rep movsq
could work with RSI and RDI offset by 16. The output area overlaps the input area, but with RDI at a lower address than RSI (and DF=0 so going upwards), the result is the same as copying a qword at a time. So hopefully fast-strings microcode can still work to copy 32 bytes at a time (after the significant startup overhead, though.) rep movsb
would also work and also be fast on CPUs with the ERMSB feature, although I worry that the near overlap could make microcode choose 1-at-a-time on some CPU in which case rep movsq
would be 8x faster.
Semi-related: How to set function arguments in assembly during runtime in a 64bit application on Windows? - preparing a flat buffer of args at run-time and using that for Windows x64.