I came across this SO question about how "final" keyword can be used to reduce virtual method overhead(Virtual function efficiency and the 'final' keyword). Based on this answer the expectation is that a derived class pointer calling overriden methods marked with final will not face the overhead of dynamic dispatch.
To benchmark the benefit of this method, I setup a few sample classes and ran it on Quick-Bench - Here is the link. There are 3 cases here:
Case 1: Derived class pointer without final specifier:
Derived* f = new DerivedWithoutFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times
Case 2: Base class pointer with final specifier:
Base* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times
Case 3: Derived class pointer with final specifier:
Derived* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times
Here the function run_multiple
looks as follows:
int run_multiple(int times) specifiers {
int sum = 0;
for(int i = 0; i < times; i++) {
sum += run_once();
}
return sum;
}
The results I observed were:
By Speed: Case 2 == Case 3 > Case 1
But shouldn't Case 3 be much faster than Case 2. Is there something wrong in my experiment design or my assumptions about the expected result?
Edit:
Peter Cordes pointed out some really useful articles for further reading related to this topic:
Is final used for optimization in C++?
Why can't gcc devirtualize this function call?
LTO, Devirtualization, and Virtual Tables
You're correctly understanding the effects of final
(except maybe the inner loop of case 2), but your cost estimates are way off. We shouldn't expect a big effect anywhere because mt19937 is just plain slow and all 3 versions spend most of their time in it.
The only thing that's not lost / buried in noise / overhead is the effect of inlining int run_once() override final
into the inner loop in FooPlus::run_multiple
, which both Case 2 and Case 3 run.
But Case 1 can't inline Foo::run_once()
into Foo::run_multiple()
, so there's function-call overhead inside the inner loop, unlike the other 2 cases.
Case 2 has to call run_multiple
repeatedly, but that's only once per 100 runs of run_once
and has no measurable effect.
For all 3 cases, most of the time is spent dist(rng);
, because std::mt19937
is pretty slow compared to the extra overhead of not inlining a function call. Out-of-order execution can probably hide a lot of that overhead, too. But not all of it, so there's still something to measure.
Case 3 is able to inline everything to this asm loop (from your quickbench link):
# percentages are *self* time, not including time spent in the PRNG
# These are from QuickBench's perf report tab,
# presumably sample for core clock cycle perf events.
# Take them with a grain of salt: superscalar + out-of-order exec
# makes it hard to blame one instruction for a clock cycle
VirtualWithFinalCase2(benchmark::State&): # case 3 from QuickBench link
... setup before the loop
.p2align 3
.Louter: # do{
xor %ebp,%ebp # sum = 0
mov $0x64,%ebx # inner = 100
.p2align 3 # nopw 0x0(%rax,%rax,1)
.Linner: # do {
51.82% mov %r13,%rdi
mov %r15,%rsi
mov %r13,%rdx # copy args from call-preserved regs
callq 404d60 # mt PRNG for unsigned long
47.27% add %eax,%ebp # sum += run_once()
add $0xffffffff,%ebx # --inner
jne .Linner # }while(inner);
mov %ebp,0x4(%rsp) # store to volatile local: benchmark::DoNotOptimize(x);
0.91% add $0xffffffffffffffff,%r12 # --outer
jne # } while(outer)
Case 2 can still inline run_once
into run_multiple
because class FooPlus
uses int run_once() override final
. There's virtual-dispatch overhead in the outer loop (only), but this small extra cost for each outer loop iteration is totally dwarfed by the cost of the inner loop (identical between Case 2 and Case 3).
So the inner loop will be essentially identical, with indirect-call overhead only in the outer loop. It's unsurprising that this is unmeasurable or at least lost in noise on Quickbench.
Case 1 can't inline Foo::run_once()
into Foo::run_multiple()
, so there's function-call overhead there, too. (The fact that it's an indirect function call is relatively minor; in a tight loop branch prediction will do a near-perfect job.)
Case 1 and Case 2 have identical asm for their outer loop, if you look at the disassembly on your Quick-Bench link.
Neither one can devirtualize and inline run_multiple
. Case 1 because it's virtual non-final, Case 2 because it's only the base class, not the derived class with a final
override.
# case 2 and case 1 *outer* loops
.loop: # do {
mov (%r15),%rax # load vtable pointer
mov $0x64,%esi # first C++ arg
mov %r15,%rdi # this pointer = hidden first arg
callq *0x8(%rax) # memory-indirect call through a vtable entry
mov %eax,0x4(%rsp) # store the return value to a `volatile` local
add $0xffffffffffffffff,%rbx
jne 4049f0 .loop # } while(--i != 0);
This is probably a missed optimization: the compiler can prove that Base *f
came from new FooPlus()
, and thus is statically known to be of type FooPlus
. operator new
can be overridden, but the compiler still emits a separate call to FooPlus::FooPlus()
(passing it a pointer to the storage from new
). So this just seems to be a cast of clang not taking advantage in Case 2 and maybe also Case 1.