Segfault when calling, but not jumping to, address in rax

I'm working with an assembler-like API (it's not really an assembler, but it can emit machine code) that I am debugging and toying with. It's specifically for System V x86_64 ABI, so I'm only going to be talking about SysV calling conventions and such.

For some reason, when I emit some contrived code like this, for testing purposes

builder.emit_sub(rsp, 1);
builder.emit_movq_vr(reinterpret_cast<uint64_t>(&hello_world), rax);
builder.emit_call(rax);
builder.emit_add(rsp, 1);
builder.emit_ret();

a segmentation fault occurs at the call (when it is run, not when being assembled), and yet

builder.emit_movq_vr(reinterpret_cast<uint64_t>(&hello_world), rax);
builder.emit_jmp(rax);

succeeds just fine. The point of failure seems to be at the call instruction, but I don't know what is bugging out the pseudo-assembler. It might be emitting the wrong opcode operands or something, but I'm not sure. The raw emitted machine code looks something like this for the buggy code, alongside the opcode that it is supposed to represent, as printed by some simple debug statements

sub    48 81 EC 01 00 00 00
movqvr 48 B8 63 80 AA 01 01 00 00 00
call   FF D0
add    48 81 C4 01 00 00 00
ret    C3

Remark: movqvr is not a real instruction [mnemonic]; the vr at the end is just a debug annotation to me saying it's a "move imm64 to reg" kind of instruction.

Remark: The sub and add are to align the stack on a 16-byte boundary, which I believe is a necessity in this ABI. They could've been better written as a push rax and a pop rax (or pop rcx if rax is needed for a return value), but ignore that, unless it is this that is messing up the call (e.g. if rsp is not being modified correctly).

Solution

Yes, in the System V ABI, the stack is aligned to a 16-byte boundary before every call instruction. Thus, on function entry it takes another 8 bytes (not 1) to reach the next 16-byte boundary. Remember that in C, pointer differences are scaled by sizeof(type), but in asm they aren't.

And yes, push rax / pop rcx would be a good choice, and is what clang / LLVM does if it doesn't already need to push an odd number of call-preserved registers or reserve any extra stack space. If you do need to reserve any stack space for locals, use an offset that will leave rsp 16-byte aligned.

BTW, you could save code size by using the sub r/m64, imm8 encoding when the immediate fits in a sign-extended 8-bit value (i.e. if ((int8_t)imm == imm)). Also, if you ever need to add / subtract +128, note that -128 fits in an imm8, so you can add rsp, -128 (e.g. after an odd number of push instructions).

If you know the address where your code will run from, you should use the call rel32 encoding, rather than a register-indirect call. But you're right that jumping to an arbitrary 64-bit address requires this mov r64, imm64 sequence, not a direct call.

Did you use a debugger to find out where hello_world crashed? Maybe if it calls printf (rather than puts), it forgot to zero al (with xor eax,eax) to indicate no FP args in XMM registers, so maybe printf used some 16-byte SSE alignement-required stores to the stack?

Having RSP not even qword-aligned is very bad, but I wouldn't expect it to have crashed anything that would crash with it 8-byte aligned (but not 16).