indirection cost ~ 3x of float multiplication, really? (with demo)

I just found that indirection cost around 3 times of float multiplication!
Is it what to be expected? Is my test wrong?

Background

After I read How much does pointer indirection affect efficiency?, I become panic about indirection cost.

Going through a pointer indirection can be much slower because of how a modern CPU works.

Before I prematurely optimize my real code, I want to make sure that it really cost much as I fear.

I do some trick to find the rough number (3x), as below :-

Step 1

Test1 : No indirection -> calculate something
Test2 : Indirection -> calculate something (same)

I found that Test2 takes more time that Test1.
Nothing surprise here.

Step 2

Test1 : No indirection -> calculate something expensive
Test2 : Indirection -> calculate something cheap

I try to change my code in calculate something expensive to be more expensive little by little to make both Test cost near the same.

Result

Finally, I found that one of possible function to make both tests use same amount of time (i.e. break-even) is :-

Test1 : No indirection -> return float*float*... 3 times
Test2 : Indirection -> simply return a float

Here is my test case (ideone demo) :-

class C{
    public: float hello;  
    public: float hello2s[10];  
    public: C(){
        hello=((double) rand() / (RAND_MAX))*10;
        for(int n=0;n<10;n++){
            hello2s[n]= ((double) rand() / (RAND_MAX))*10;
        }
    }
    public: float calculateCheap(){
        return hello;
    }
    public: float calculateExpensive(){
        float result=1;
        result=hello2s[0]*hello2s[1]*hello2s[2]*hello2s[3]*hello2s[4];
        return result;
    }
};

Here is the main :-

int main(){
    const int numTest=10000;
    C  d[numTest];
    C* e[numTest];
    for(int n=0;n<numTest;n++){
        d[n]=C();
        e[n]=new C();
    }
    float accu=0;
    auto t1= std::chrono::system_clock::now();
    for(int n=0;n<numTest;n++){
        accu+=d[n].calculateExpensive();  //direct call
    }
    auto t2= std::chrono::system_clock::now();
    for(int n=0;n<numTest;n++){
        accu+=e[n]->calculateCheap();     //indirect call
    }
    auto t3= std::chrono::system_clock::now();
    std::cout<<"direct call time ="<<(t2-t1).count()<<std::endl;
    std::cout<<"indirect call time ="<<(t3-t2).count()<<std::endl;
    std::cout<<"print to disable compiler cheat="<<accu<<std::endl;
}

Direct call time's and Indirect call time is tuned to be similar as mentioned above (via editing calculateExpensive).

Conclusion

Indirection cost = 3x float multiplication.
In my desktop (Visual Studio 2015 with -O2), it is 7x.

Question

Is it to be expected that indirection cost around 3x of float multiplication?
If no, how is my test wrong?

(Thank enhzflep for suggesting improvement, it is edited.)

Solution

The cost of indirection is dominated by Cache Misses. Because honestly, Cache Misses are so much more expensive than anything else you are talking about, everything else ends up being rounding error.

Cache Misses and indirection can be far more expensive than your test indicates.

This is mostly because you have a mere 100,000 elements, and a CPU cache can cache every one of those floats. The sequential heap allocations will tend to clump.

You'll get a pile of cache misses, but not one for every element.

Both of your cases are indirect. The "indirect" case has to follow two pointers, and the "direct" case has to do one instance of pointer arithmetic. The "expensive" case may be suitable for some SIMD, especially if you have relaxed floating point accuracy (allowing multiplication reordering).

As seen here, or this image (not inline, I lack rights), the number of main memory references are going to dominate over almost anything else in the above code. A 2 Ghz CPU has a 0.5 ns cycle time, and a main memory reference is 100 ns or 200 cycles of latency.

Meanwhile, a desktop CPU can hit 8+ floating point operations per cycle if you can pull off vectorized code. That is a potential 1600x faster floating point operation than a single cache miss.

Indirection can cost you being able to use vectorized instructions (8x slowdown), and if everything is in cache can still require L2 cache references (14x slowdown) more often than the alternative. But these slowdowns are small compared to the 200 ns main memory reference delay.

Note that not all CPUs have the same level of vectorization, that some effort is being put into speeding up CPU/main memory delay, that FPUs have different characteristics, and a myriad of other complications.