c gcc assembly interop calling-convention

How to write / build C code to avoid conflicts with existing assembly code?

I need to integrate some C code with an existing project written in assembly. That project uses a number of registers for its internal purposes, so I do not want the C code to overwrite them.

Can I specify which registers GCC can / cannot use? Or should I rather save the registers before calling C code and then restore them?

Also, what are other caveats to be aware of?

Solution

Normally the standard calling convention is pretty reasonable, and specifies some registers as call-clobbered, and some as call-preserved. Use call-preserved registers for values that you want to survive function calls. For example, see the function-calling convention part of What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

The standard, but less descriptive, terms are "caller-saved" vs. "callee saved" (confusing because it's normal for nobody to save a call-clobbered register, and let the value die if you don't need it), or "volatile" vs. "non-volatile": somewhat bogus because volatile already has an unrelated specific technical meaning in C.

I like call-preserved vs. call-clobbered because it describes both kinds of registers from the perspective of the current function using them.

You can use whatever custom calling convention you want for calls between hand-written asm functions, documenting the convention in comments on a per-function basis. It's normally a good idea to use the standard calling convention for your platform as much as possible, only customizing when there's a speedup to be had. Most are fairly well-designed and strike a good balance between performance and code-size, passing args efficiently and so on.

One exception to the rule is that the i386 32-bit calling convention (used on Linux) sucks. It passes all args on the stack, not registers. You can customize the calling convention x86 gcc will use with -mregparm=2 -msseregparm for example, to pass the first 2 integer args in eax and edx on 32-bit x86. 32-bit Windows often uses a calling convention like this, e.g. _vectorcall. If you're on x86, see Agner Fog's calling convention guide (and other x86 asm optimization guides).

GCC does have some code-gen options that modify the calling convention registers.

You can tell gcc that it must not touch a register at all with -ffixed-reg, e.g. -ffixed-rbx (so it will still have your value in an interrupt or signal handler, for example).

Or you can tell gcc that a register is call-preserved (-fcall-saved-reg), so it can use it as long as it saves/restores it. This is probably what you want if you just want gcc to put things back when its done, without gimping its ability to free up registers for cases where having an extra register is worth saving/restoring one. (If that C code calls back into your asm, it will be expecting your asm functions to follow the same calling convention you've told it about.)

Interestingly -fcall-saved-reg seems to work even for arg-passing registers, so you could make multiple function calls without reloading registers.

And finally, -fcall-used-reg tells the compiler it's free to clobber a register.

Note that it's an error to use -fcall-saved on a return-value register, or -fcall-used on the stack or frame pointer, but gcc may silently do silly things instead of warning!

It is an error to use this flag with the frame pointer or stack pointer. Use of this flag for other registers that have fixed pervasive roles in the machine’s execution model produces disastrous results.

So these advanced options may not protect you from yourself if you use them in a silly way; wear your safety goggles + hard hat. You have been warned.

Example: I used x86-64, but it should be equivalent for any other architecture.

// tempt the compiler into using lots of registers
// to keep values across loop iterations.
int foo(int a, int *p, int len) {
    int t1 = a * 2, t2 = a-1, t3 = a>>3;
    int max= p[0];

    for (int i=0 ; i<len ; i++) {
        p[i] *= t1;
        p[i] |= t2;
        p[i] ^= t3;
        max = (p[i] < max) ? max : p[i];
    }

    return max;
}

On Godbolt for x86-64 with gcc6.3 -O3 -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-rsi -fno-tree-vectorize

foo:        # args in the x86-64 SysV convention: int edi, int *rsi, int edx
    lea     r9d, [rdi+rdi]
    lea     r10d, [rdi-1]
    mov     eax, DWORD PTR [rsi]
    sar     edi, 3
    test    edx, edx          # check if loop runs at least once: len <= 0
    jle     .L10
    push    rsi               # save of normally volatile RSI
    lea     r8d, [rdx-1]
    push    rdx               # and RDX
    lea     r11, [rsi+4+r8*4]
.L3:
    mov     r8d, DWORD PTR [rsi]
    imul    r8d, r9d          # and use of temporaries that require a REX prefix
    or      r8d, r10d
    xor     r8d, edi
    cmp     eax, r8d
    mov     DWORD PTR [rsi], r8d
    cmovl   eax, r8d
    add     rsi, 4            # pointer-increment of RSI as the loop counter
    cmp     r11, rsi
    jne     .L3
    pop     rdx               # and restore RDX + RSI
    pop     rsi
.L10:
    ret

Note the use of r8-r11 as temporaries. These registers require a REX prefix to access, adding 1 byte of code size unless you already needed 32-bit operand size. So gcc prefers using the low 8 registers (eax..ebp) for scratch regs, only using r8d if it would otherwise have to save/restore rbx or rbp.

The code-gen is basically the same without the -fcall-saved-reg options, but with a different choice of registers and no push/pop.