I'm working with an assembler-like API (it's not really an assembler, but it can emit machine code) that I am debugging and toying with. It's specifically for System V x86_64 ABI, so I'm only going to be talking about SysV calling conventions and such.
For some reason, when I emit some contrived code like this, for testing purposes
builder.emit_sub(rsp, 1);
builder.emit_movq_vr(reinterpret_cast<uint64_t>(&hello_world), rax);
builder.emit_call(rax);
builder.emit_add(rsp, 1);
builder.emit_ret();
a segmentation fault occurs at the call (when it is run, not when being assembled), and yet
builder.emit_movq_vr(reinterpret_cast<uint64_t>(&hello_world), rax);
builder.emit_jmp(rax);
succeeds just fine. The point of failure seems to be at the call
instruction, but I don't know what is bugging out the pseudo-assembler. It might be emitting the wrong opcode operands or something, but I'm not sure. The raw emitted machine code looks something like this for the buggy code, alongside the opcode that it is supposed to represent, as printed by some simple debug statements
sub 48 81 EC 01 00 00 00
movqvr 48 B8 63 80 AA 01 01 00 00 00
call FF D0
add 48 81 C4 01 00 00 00
ret C3
Remark: movqvr
is not a real instruction [mnemonic]; the vr
at the end is just a debug annotation to me saying it's a "move imm64 to reg" kind of instruction.
Remark: The sub
and add
are to align the stack on a 16-byte boundary, which I believe is a necessity in this ABI. They could've been better written as a push rax
and a pop rax
(or pop rcx
if rax
is needed for a return value), but ignore that, unless it is this that is messing up the call (e.g. if rsp
is not being modified correctly).
Yes, in the System V ABI, the stack is aligned to a 16-byte boundary before every call
instruction. Thus, on function entry it takes another 8 bytes (not 1) to reach the next 16-byte boundary. Remember that in C, pointer differences are scaled by sizeof(type)
, but in asm they aren't.
And yes, push rax
/ pop rcx
would be a good choice, and is what clang / LLVM does if it doesn't already need to push an odd number of call-preserved registers or reserve any extra stack space. If you do need to reserve any stack space for locals, use an offset that will leave rsp
16-byte aligned.
BTW, you could save code size by using the sub r/m64, imm8
encoding when the immediate fits in a sign-extended 8-bit value (i.e. if ((int8_t)imm == imm)
). Also, if you ever need to add / subtract +128, note that -128
fits in an imm8, so you can add rsp, -128
(e.g. after an odd number of push
instructions).
If you know the address where your code will run from, you should use the call rel32
encoding, rather than a register-indirect call. But you're right that jumping to an arbitrary 64-bit address requires this mov r64, imm64
sequence, not a direct call
.
Did you use a debugger to find out where hello_world
crashed? Maybe if it calls printf
(rather than puts
), it forgot to zero al
(with xor eax,eax
) to indicate no FP args in XMM registers, so maybe printf used some 16-byte SSE alignement-required stores to the stack?
Having RSP not even qword-aligned is very bad, but I wouldn't expect it to have crashed anything that would crash with it 8-byte aligned (but not 16).