c++x86-64 inline-assembly calling-convention clang-cl

Ensuring x64 compliance of custom ASM-function in clang-cl

For my custom compiled native x64 JIT-code, I have certain instrinsic functions. A lot of them are just called from my code, thus I will generate with my own compiler. Some of them however are directly called from c++-code, and thus I want to have them be compiled inside a static-lib, so that they can at least be linked statically, if not inlined.

I need to use inline-assembly for those functions, as they perform actions that cannot be expressed in regular C++, like setting a non-volatile register from a function-input. However, the function itself must behave like a regular x64-function - it needs a prolog/epilogue, and it must have the necessary unwind-information to support stack-traces and exception-handling. Thus I cannot use MSVC (which is my native compiler), so I decided to make a static lib with clang-cl in Visual Studio instead. The best I got so far is the following:

void interruptEntry(void* pState, const char* pAddress)
{
    __asm
    {
        // load state into RBX
        mov rbx,rcx
        // load callstack-top into RDI
        mov rax,[rbx]
        mov rdi,[rax]

        // call address
        call rdx
    };
}

This will generate the proper prolog, epilogue and all required unwind-information. However, it critically lacks the 32 bytes of shadow-space that are necessary by x64 (which pAddress needs to be called by):

Acclimate Engine.dll!interruptEntry(void *, const char *):
 push        rdi  
 push        rbx  
 mov         rbx,rcx  
 mov         rax,qword ptr [rbx]  
 mov         rdi,qword ptr [rax]  
 call        rdx  
 pop         rbx  
 pop         rdi  
 ret

Keep in mind, while this code is generated via clang-cl, the DLL is linked with MSVC. The static-lib is compiled with O2 (set from the VisualStudio-project page).

Things I've tried:

Modifying RSP manually, with sub RSP,32. This results in a frame-pointer register being established, as the compiler will count this as a dynamic stack allocation. This adds too much overhead to make it worth using a statically compiled function in the first place
Similarily, I could reference "pState" directly in asm (mov rbx,pState), this will cause the shadow-space to be added - but also, pState will then be copied onto the stack, and loaded into rbx from that stack location, instead of the register. This once again defeats the purpose of what I am doing here.
Calling "pAddress" as a function-pointer directly, after the asm-block. This will still not result in any difference in code-gen
Using normal asm(), or extended asm, in combination with "attribute((naked))". That will not generate the prolog/epilogue, which I can write myself - but then the unwind-information is missing. clang-cl seems to not understand any of the unwind-data directives, like .allocstack or .pushreg, resulting in a "error : unknown directive" - regardless of in which type of asm-block it's being used.

Is there any reason why the shadow-space is missing, and any way to get it there without adding any uncessary overhead like a frame-pointer (while still having unwind-information)? I'm also open for other suggestions - for example, if there is some intrinsic that let's me set those registers (while still compiling down to the one move), I would not need to use assembly (manipulating specific registers with global effect is the main reason I cannot write plain C++).

Solution

Making calls from inline asm is generally not well supported. Avoid whenever possible.
The compiler only scans the inline asm block to see what registers are potentially clobbered; it doesn't assume that call instructions in asm are to functions that follow the standard calling convention for this target (otherwise why would you be using inline asm in the first place?) So it's a huge pain to do it safely, same for x86-64 System V (Calling printf in extended inline ASM - using GNU C inline asm you also have to declare all the register clobbers yourself, as well as take care of the red-zone since there's no way to declare a clobber on that.)

Your idea of using inline asm to leave values in regs and block tail-call optimization is a good idea. But the implementation in your self-answer with two separate asm() statements doesn't do anything to stop the compiler from stepping on RBX with the instructions it emits for code outside the asm statements. A different compiler or version could easily break your code by picking RBX as a temporary instead of RAX when compiling that code between the asm statements. (And since you didn't use __attribute__((noinline)), code from parent functions could be scheduled here.)

You can write it in a way that discourages the compiler from stepping on your registers. Make those values needed in those registers after the call (as inputs to an empty asm statement), so the asm you want is the only efficient choice. That makes it a lot less likely that this will break in practice.

class ExecutionStateJIT;
using Func = void ();

void interruptEntry_safer(ExecutionStateJIT& state, Func pAddress)
{
    register void *state_addr asm("rbx") = &state;
    // strict-aliasing violation, see alternate version that's safe without -fno-strict-aliasing
    register auto* pTemp asm("rdi") = * *((void***)&state);  // two derefs
    // register ... asm("rbx") is actually redundant since I also used specific-register constraints in the asm statement

    // request those vars in RDI and RBX respectively
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // make sure they're actually loaded before
    pAddress();  // this just uses a function-pointer arg that was already in a register, doesn't need to touch any others

    // prevent a tailcall which would restore RDI and RBX before calling
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // and still wanted in these call-preserved registers after
}

Using register T foo asm("regname") local register variables lets you ask for values in any of R8-R15 which don't have specific-register constraint letters, forcing an "r"(var) constraint to pick a specific register. (And for many other ISAs, there aren't letters for any single registers.) It's not actually needed here because the "D" and "b" constraints require RDI and RBX respectively.

Godbolt shows it works in GCC and clang (-masm=ms for GCC, and -target x86_64-w64-windows-gnu for Clang. As a bonus, this doesn't require very-recent Clang for -masm=intel to apply to asm statements, since the actual templates are empty. The action is in the constraints, requiring the compiler to have both values in the registers we want, but without any

Your code also violates the strict-aliasing rule by pointing a void ** at an object of a different type. Only [unsigned] char* and pointers to objects declared with __attribute__((may_alias)) can be pointed at arbitrary things in GNU C. But for compat with MSVC, clang-cl probably enables -fno-strict-aliasing.

class ExecutionStateJIT;
using Func = void ();
void interruptEntry_safer_strict_aliasing(ExecutionStateJIT& state, Func pAddress)
{
    using voidp = void*;
    using aliasing_voidp = __attribute__((may_alias)) voidp;
    // aliasing_voidp is a pointer-to-void (e.g. 8 bytes on x86-64).
    // aliasing_voidp*  can be pointed at any object safely, to let us load a void*
    void *state_addr = &state;
    auto* pTemp = *(aliasing_voidp*)state_addr;  // like memcpy but alignment guaranteed because no __attribute__((aligned(1)))
    pTemp = *(aliasing_voidp*)pTemp;

    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // make sure they're actually loaded before
    pAddress();  // this just uses a function-pointer arg that was already in a register, doesn't need to touch any others
    // prevent a tailcall which would restore RDI and RBX before calling
    asm volatile("" :: "D"(pTemp), "b"(state_addr));  // and still wanted in these call-preserved registers after
}

Both of these compile to the same asm as yours with current versions of GCC and clang, and the source is shorter and easier to read (if you know GNU C inline asm). The point is that they will more reliably do so with future versions and even if inlined into other surrounding code.

interruptEntry_safer_strict_aliasing(ExecutionStateJIT&, void (*)()):
        push    rdi
        push    rbx
        mov     rbx, rcx               # state_addr
        sub     rsp, 40
        mov     rax, QWORD PTR [rcx]   # first pTemp
        mov     rdi, QWORD PTR [rax]   # second value of pTemp
        call    rdx
# clang puts a NOP here for some reason
        add     rsp, 40
        pop     rbx
        pop     rdi
        ret

I didn't use __attribute__((noinline)) on my versions since even in a use-case where they do inline into a caller (e.g. -flto link-time optimization), the asm statements hopefully convinces the compiler not to do something else with RBX or RDI in that window between the asm statement and the call, if it is moving code around to try to schedule it more efficiently.