Search code examples
c++cvariadic-functionsinline-functions

Inlining of vararg functions


While playing about with optimisation settings, I noticed an interesting phenomenon: functions taking a variable number of arguments (...) never seemed to get inlined. (Obviously this behavior is compiler-specific, but I've tested on a couple of different systems.)

For example, compiling the following small programm:

#include <stdarg.h>
#include <stdio.h>

static inline void test(const char *format, ...)
{
  va_list ap;
  va_start(ap, format);
  vprintf(format, ap);
  va_end(ap);
}

int main()
{
  test("Hello %s\n", "world");
  return 0;
}

will seemingly always result in a (possibly mangled) test symbol appearing in the resulting executable (tested with Clang and GCC in both C and C++ modes on MacOS and Linux). If one modifies the signature of test() to take a plain string which is passed to printf(), the function is inlined from -O1 upwards by both compilers as you'd expect.

I suspect this is to do with the voodoo magic used to implement varargs, but how exactly this is usually done is a mystery to me. Can anybody enlighten me as to how compilers typically implement vararg functions, and why this seemingly prevents inlining?


Solution

  • At least on x86-64, the passing of var_args is quite complex (due to passing arguments in registers). Other architectures may not be quite so complex, but it is rarely trivial. In particular, having a stack-frame or frame pointer to refer to when getting each argument may be required. These sort of rules may well stop the compiler from inlining the function.

    The code for x86-64 includes pushing all the integer arguments, and 8 sse registers onto the stack.

    This is the function from the original code compiled with Clang:

    test:                                   # @test
        subq    $200, %rsp
        testb   %al, %al
        je  .LBB1_2
    # BB#1:                                 # %entry
        movaps  %xmm0, 48(%rsp)
        movaps  %xmm1, 64(%rsp)
        movaps  %xmm2, 80(%rsp)
        movaps  %xmm3, 96(%rsp)
        movaps  %xmm4, 112(%rsp)
        movaps  %xmm5, 128(%rsp)
        movaps  %xmm6, 144(%rsp)
        movaps  %xmm7, 160(%rsp)
    .LBB1_2:                                # %entry
        movq    %r9, 40(%rsp)
        movq    %r8, 32(%rsp)
        movq    %rcx, 24(%rsp)
        movq    %rdx, 16(%rsp)
        movq    %rsi, 8(%rsp)
        leaq    (%rsp), %rax
        movq    %rax, 192(%rsp)
        leaq    208(%rsp), %rax
        movq    %rax, 184(%rsp)
        movl    $48, 180(%rsp)
        movl    $8, 176(%rsp)
        movq    stdout(%rip), %rdi
        leaq    176(%rsp), %rdx
        movl    $.L.str, %esi
        callq   vfprintf
        addq    $200, %rsp
        retq
    

    and from gcc:

    test.constprop.0:
        .cfi_startproc
        subq    $216, %rsp
        .cfi_def_cfa_offset 224
        testb   %al, %al
        movq    %rsi, 40(%rsp)
        movq    %rdx, 48(%rsp)
        movq    %rcx, 56(%rsp)
        movq    %r8, 64(%rsp)
        movq    %r9, 72(%rsp)
        je  .L2
        movaps  %xmm0, 80(%rsp)
        movaps  %xmm1, 96(%rsp)
        movaps  %xmm2, 112(%rsp)
        movaps  %xmm3, 128(%rsp)
        movaps  %xmm4, 144(%rsp)
        movaps  %xmm5, 160(%rsp)
        movaps  %xmm6, 176(%rsp)
        movaps  %xmm7, 192(%rsp)
    .L2:
        leaq    224(%rsp), %rax
        leaq    8(%rsp), %rdx
        movl    $.LC0, %esi
        movq    stdout(%rip), %rdi
        movq    %rax, 16(%rsp)
        leaq    32(%rsp), %rax
        movl    $8, 8(%rsp)
        movl    $48, 12(%rsp)
        movq    %rax, 24(%rsp)
        call    vfprintf
        addq    $216, %rsp
        .cfi_def_cfa_offset 8
        ret
        .cfi_endproc
    

    In clang for x86, it is much simpler:

    test:                                   # @test
        subl    $28, %esp
        leal    36(%esp), %eax
        movl    %eax, 24(%esp)
        movl    stdout, %ecx
        movl    %eax, 8(%esp)
        movl    %ecx, (%esp)
        movl    $.L.str, 4(%esp)
        calll   vfprintf
        addl    $28, %esp
        retl
    

    There's nothing really stopping any of the above code from being inlined as such, so it would appear that it is simply a policy decision on the compiler writer. Of course, for a call to something like printf, it's pretty meaningless to optimise away a call/return pair for the cost of the code expansion - after all, printf is NOT a small short function.

    (A decent part of my work for most of the past year has been to implement printf in an OpenCL environment, so I know far more than most people will ever even look up about format specifiers and various other tricky parts of printf)

    Edit: The OpenCL compiler we use WILL inline calls to var_args functions, so it is possible to implement such a thing. It won't do it for calls to printf, because it bloats the code very much, but by default, our compiler inlines EVERYTHING, all the time, no matter what it is... And it does work, but we found that having 2-3 copies of printf in the code makes it REALLY huge (with all sorts of other drawbacks, including final code generation taking a lot longer due to some bad choices of algorithms in the compiler backend), so we had to add code to STOP the compiler doing that...