c++performance assembly x86-64 function-pointers

Function pointer performance; slower on a single call than multiple calls?

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.

#include <chrono>
#include <iostream>

using namespace std;

__attribute__((noinline)) int plus_one(int x) {
  return x + 1;
}

typedef int (*FUNC)(int);

#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);

  auto constexpr g_count = 1000000;
  
__attribute__((noinline)) int speed_test_no_param() {

  int r;

  auto local_lambda = [](int a) {
    return plus_one(a);
  };

  FUNC f = local_lambda;
  START_TIMING();
  for (auto i = 0; i < g_count; ++i)
    r = f(100);
  END_TIMING("speed_test_no_param");
  
  return r;
}

__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
  int r;
  START_TIMING();
  for (auto i = 0; i < g_count; ++i)
    r = f(100);
  END_TIMING("speed_test_with_param");
  
  return r;
}

int main() {

  int ret = 0;
  auto main_lambda = [](int a) {
    return plus_one(a);
  };

  ret += speed_test_no_param();
  FUNC fp = main_lambda;
  ret += speed_test_with_param(fp);

  return ret;
}

Built on Ubuntu 20.04 with:

g++ -ggdb -ffunction-sections -O3 -std=c++17   -DNDEBUG=1 -DRELEASE=1  -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections    -Wl,--start-group speed_test.o   -Wl,--rpath='$ORIGIN'   -Wl,--end-group

The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:

Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849

When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is:

0x000055555555534b  call 0x555555555310 <plus_one(int)>

... whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:

0x000055555555544e  call QWORD PTR [rbx] 
...
0x0000555555555324  jmp 0x555555555310 <plus_one(int)>

(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)

What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:

Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31

I have also run this many times; the parameter version is always the fastest.

I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.

Can someone please explain?

UPDATE

As per the comment below, ordering matters. When I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!

Solution

With multiple loop iterations in the C++ source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.

Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?

Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?

Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.

It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.

You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program

The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.

Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.

An indirect call can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.

So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.

If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.

(Microbenchmarking is hard, especially when you're trying to benchmark a C++ concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the + operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)