Search code examples
c++chpc

Are inline functions passed as argument, really executed inline in C/C++?


I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:

function expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(i=0; i<VeryLargeN; i++){
       element=elements[i]
       // long computation that produce a boolean condition
       if (condition){ 
         do_true(element); 
       }else{
         do_false(element);
       }
    }
}

Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.

To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.

So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?


Solution

  • The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.

    If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.

    struct function_one {
      void operator()( int element ) {
      }
    };
    
    extern int elements[];
    extern bool condition();
    
    template < typename DoTrue, typename DoFalse >
    void expensive_loop(){
      DoTrue do_true;
      DoFalse do_false;
    
      for(int i=0; i<50; i++){
        int element=elements[i];
        // long computation that produce a boolean condition
        if (condition()){ 
          do_true(element); // call DoTrue's operator()
        }else{
          do_false(element); // call DoFalse's operator()
        }
      }
    }
    
    int main( int argc, char* argv[] ) {
        expensive_loop<function_one,function_one>();
    
    return 0;
    }
    

    The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.

    For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:

    main:
        push    rbx
        mov     ebx, 50
    .L2:
        call    condition()
        sub     ebx, 1
        jne     .L2
        xor     eax, eax
        pop     rbx
        ret
    

    See example in https://godbolt.org/g/hV52Nn

    Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop

    // File A.cpp
    void foo( int arg );
    void bar( int arg );
    
    extern bool condition();
    extern int elements[];
    
    void expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
        for(int i=0; i<50; i++){
           int element=elements[i];
           // long computation that produce a boolean condition
           if (condition()){
             do_true(element);
           }else{
             do_false(element);
           }
        }
    }
    
    int main( int argc, char* argv[] ) {
        expensive_loop( foo, bar );
    
        return 0;
    }
    

    and the functions passed by argument

    // File B.cpp
    #include <math.h>
    
    int elements[50];
    
    bool condition() {
        return elements[0] == 1;
    }
    
    inline int foo( int arg ) {
        return arg%3;
    }
    
    inline int bar( int arg ) {
        return 1234%arg;
    }
    

    are defined in different translation units.

    0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
      400620:       41 55                   push   %r13
      400622:       49 89 fd                mov    %rdi,%r13
      400625:       41 54                   push   %r12
      400627:       49 89 f4                mov    %rsi,%r12
      40062a:       55                      push   %rbp
      40062b:       53                      push   %rbx
      40062c:       bb 60 10 60 00          mov    $0x601060,%ebx
      400631:       48 83 ec 08             sub    $0x8,%rsp
      400635:       eb 19                   jmp    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
      400637:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
      40063e:       00 00
      400640:       48 83 c3 04             add    $0x4,%rbx
      400644:       41 ff d5                callq  *%r13
      400647:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
      40064e:       74 1d                   je     40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
      400650:       8b 2b                   mov    (%rbx),%ebp
      400652:       e8 79 ff ff ff          callq  4005d0 <condition()>
      400657:       84 c0                   test   %al,%al
      400659:       89 ef                   mov    %ebp,%edi
      40065b:       75 e3                   jne    400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
      40065d:       48 83 c3 04             add    $0x4,%rbx
      400661:       41 ff d4                callq  *%r12
      400664:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
      40066b:       75 e3                   jne    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
      40066d:       48 83 c4 08             add    $0x8,%rsp
      400671:       5b                      pop    %rbx
      400672:       5d                      pop    %rbp
      400673:       41 5c                   pop    %r12
      400675:       41 5d                   pop    %r13
      400677:       c3                      retq
      400678:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
      40067f:       00
    

    You can see how the calls are still performed even when using -O3 optimization level:

    400644:       41 ff d5                callq  *%r13