Are inline functions passed as argument, really executed inline in C/C++?

I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:

function expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(i=0; i<VeryLargeN; i++){
       element=elements[i]
       // long computation that produce a boolean condition
       if (condition){ 
         do_true(element); 
       }else{
         do_false(element);
       }
    }
}

Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.

To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.

So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?

Solution

The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.

If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.

struct function_one {
  void operator()( int element ) {
  }
};

extern int elements[];
extern bool condition();

template < typename DoTrue, typename DoFalse >
void expensive_loop(){
  DoTrue do_true;
  DoFalse do_false;

  for(int i=0; i<50; i++){
    int element=elements[i];
    // long computation that produce a boolean condition
    if (condition()){ 
      do_true(element); // call DoTrue's operator()
    }else{
      do_false(element); // call DoFalse's operator()
    }
  }
}

int main( int argc, char* argv[] ) {
    expensive_loop<function_one,function_one>();

return 0;
}

The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.

For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:

main:
    push    rbx
    mov     ebx, 50
.L2:
    call    condition()
    sub     ebx, 1
    jne     .L2
    xor     eax, eax
    pop     rbx
    ret

See example in https://godbolt.org/g/hV52Nn

Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop

// File A.cpp
void foo( int arg );
void bar( int arg );

extern bool condition();
extern int elements[];

void expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(int i=0; i<50; i++){
       int element=elements[i];
       // long computation that produce a boolean condition
       if (condition()){
         do_true(element);
       }else{
         do_false(element);
       }
    }
}

int main( int argc, char* argv[] ) {
    expensive_loop( foo, bar );

    return 0;
}

and the functions passed by argument

// File B.cpp
#include <math.h>

int elements[50];

bool condition() {
    return elements[0] == 1;
}

inline int foo( int arg ) {
    return arg%3;
}

inline int bar( int arg ) {
    return 1234%arg;
}

are defined in different translation units.

0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
  400620:       41 55                   push   %r13
  400622:       49 89 fd                mov    %rdi,%r13
  400625:       41 54                   push   %r12
  400627:       49 89 f4                mov    %rsi,%r12
  40062a:       55                      push   %rbp
  40062b:       53                      push   %rbx
  40062c:       bb 60 10 60 00          mov    $0x601060,%ebx
  400631:       48 83 ec 08             sub    $0x8,%rsp
  400635:       eb 19                   jmp    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  400637:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40063e:       00 00
  400640:       48 83 c3 04             add    $0x4,%rbx
  400644:       41 ff d5                callq  *%r13
  400647:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40064e:       74 1d                   je     40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
  400650:       8b 2b                   mov    (%rbx),%ebp
  400652:       e8 79 ff ff ff          callq  4005d0 <condition()>
  400657:       84 c0                   test   %al,%al
  400659:       89 ef                   mov    %ebp,%edi
  40065b:       75 e3                   jne    400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
  40065d:       48 83 c3 04             add    $0x4,%rbx
  400661:       41 ff d4                callq  *%r12
  400664:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40066b:       75 e3                   jne    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  40066d:       48 83 c4 08             add    $0x8,%rsp
  400671:       5b                      pop    %rbx
  400672:       5d                      pop    %rbp
  400673:       41 5c                   pop    %r12
  400675:       41 5d                   pop    %r13
  400677:       c3                      retq
  400678:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40067f:       00

You can see how the calls are still performed even when using -O3 optimization level:

400644:       41 ff d5                callq  *%r13