Using final to reduce virtual method overhead

I came across this SO question about how "final" keyword can be used to reduce virtual method overhead(Virtual function efficiency and the 'final' keyword). Based on this answer the expectation is that a derived class pointer calling overriden methods marked with final will not face the overhead of dynamic dispatch.

To benchmark the benefit of this method, I setup a few sample classes and ran it on Quick-Bench - Here is the link. There are 3 cases here:
Case 1: Derived class pointer without final specifier:

Derived* f = new DerivedWithoutFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Case 2: Base class pointer with final specifier:

Base* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Case 3: Derived class pointer with final specifier:

Derived* f = new DerivedWithFinalSpecifier();
f->run_multiple(100); // calls an overriden method 100 times

Here the function run_multiple looks as follows:

int run_multiple(int times) specifiers {
    int sum = 0;
    for(int i = 0; i < times; i++) {
        sum += run_once();
    }
    return sum;
}

The results I observed were:
By Speed: Case 2 == Case 3 > Case 1

But shouldn't Case 3 be much faster than Case 2. Is there something wrong in my experiment design or my assumptions about the expected result?

Edit: Peter Cordes pointed out some really useful articles for further reading related to this topic:
Is final used for optimization in C++?
Why can't gcc devirtualize this function call?
LTO, Devirtualization, and Virtual Tables

Solution

You're correctly understanding the effects of final (except maybe the inner loop of case 2), but your cost estimates are way off. We shouldn't expect a big effect anywhere because mt19937 is just plain slow and all 3 versions spend most of their time in it.

The only thing that's not lost / buried in noise / overhead is the effect of inlining int run_once() override final into the inner loop in FooPlus::run_multiple, which both Case 2 and Case 3 run.

But Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead inside the inner loop, unlike the other 2 cases.

Case 2 has to call run_multiple repeatedly, but that's only once per 100 runs of run_once and has no measurable effect.

For all 3 cases, most of the time is spent dist(rng);, because std::mt19937 is pretty slow compared to the extra overhead of not inlining a function call. Out-of-order execution can probably hide a lot of that overhead, too. But not all of it, so there's still something to measure.

Case 3 is able to inline everything to this asm loop (from your quickbench link):

 # percentages are *self* time, not including time spent in the PRNG
 # These are from QuickBench's perf report tab,
 #  presumably sample for core clock cycle perf events.
 # Take them with a grain of salt: superscalar + out-of-order exec
 #  makes it hard to blame one instruction for a clock cycle

   VirtualWithFinalCase2(benchmark::State&):   # case 3 from QuickBench link
     ... setup before the loop
     .p2align 3
    .Louter:                # do{
       xor    %ebp,%ebp          # sum = 0
       mov    $0x64,%ebx         # inner = 100
     .p2align 3  #  nopw   0x0(%rax,%rax,1)
     .Linner:                    # do {
51.82% mov    %r13,%rdi
       mov    %r15,%rsi
       mov    %r13,%rdx           # copy args from call-preserved regs
       callq  404d60              # mt PRNG for unsigned long
47.27% add    %eax,%ebp           # sum += run_once()
       add    $0xffffffff,%ebx    # --inner
       jne    .Linner            # }while(inner);
       mov    %ebp,0x4(%rsp)     # store to volatile local:  benchmark::DoNotOptimize(x);
0.91%  add    $0xffffffffffffffff,%r12   # --outer
       jne                    # } while(outer)

Case 2 can still inline run_once into run_multiple because class FooPlus uses int run_once() override final. There's virtual-dispatch overhead in the outer loop (only), but this small extra cost for each outer loop iteration is totally dwarfed by the cost of the inner loop (identical between Case 2 and Case 3).

So the inner loop will be essentially identical, with indirect-call overhead only in the outer loop. It's unsurprising that this is unmeasurable or at least lost in noise on Quickbench.

Case 1 can't inline Foo::run_once() into Foo::run_multiple(), so there's function-call overhead there, too. (The fact that it's an indirect function call is relatively minor; in a tight loop branch prediction will do a near-perfect job.)

Case 1 and Case 2 have identical asm for their outer loop, if you look at the disassembly on your Quick-Bench link.

Neither one can devirtualize and inline run_multiple. Case 1 because it's virtual non-final, Case 2 because it's only the base class, not the derived class with a final override.

        # case 2 and case 1 *outer* loops
      .loop:                 # do {
       mov    (%r15),%rax     # load vtable pointer
       mov    $0x64,%esi      # first C++ arg
       mov    %r15,%rdi       # this pointer = hidden first arg
       callq  *0x8(%rax)      # memory-indirect call through a vtable entry
       mov    %eax,0x4(%rsp)  # store the return value to a `volatile` local
       add    $0xffffffffffffffff,%rbx      
       jne    4049f0 .loop   #  } while(--i != 0);

This is probably a missed optimization: the compiler can prove that Base *f came from new FooPlus(), and thus is statically known to be of type FooPlus. operator new can be overridden, but the compiler still emits a separate call to FooPlus::FooPlus() (passing it a pointer to the storage from new). So this just seems to be a cast of clang not taking advantage in Case 2 and maybe also Case 1.