I just found that indirection cost around 3 times of float multiplication!
Is it what to be expected? Is my test wrong?
After I read How much does pointer indirection affect efficiency?, I become panic about indirection cost.
Going through a pointer indirection can be much slower because of how a modern CPU works.
Before I prematurely optimize my real code, I want to make sure that it really cost much as I fear.
I do some trick to find the rough number (3x), as below :-
I found that Test2 takes more time that Test1.
Nothing surprise here.
I try to change my code in calculate something expensive
to be more expensive little by little to make both Test cost near the same.
Finally, I found that one of possible function to make both tests use same amount of time (i.e. break-even) is :-
float*float*...
3 times float
Here is my test case (ideone demo) :-
class C{
public: float hello;
public: float hello2s[10];
public: C(){
hello=((double) rand() / (RAND_MAX))*10;
for(int n=0;n<10;n++){
hello2s[n]= ((double) rand() / (RAND_MAX))*10;
}
}
public: float calculateCheap(){
return hello;
}
public: float calculateExpensive(){
float result=1;
result=hello2s[0]*hello2s[1]*hello2s[2]*hello2s[3]*hello2s[4];
return result;
}
};
Here is the main :-
int main(){
const int numTest=10000;
C d[numTest];
C* e[numTest];
for(int n=0;n<numTest;n++){
d[n]=C();
e[n]=new C();
}
float accu=0;
auto t1= std::chrono::system_clock::now();
for(int n=0;n<numTest;n++){
accu+=d[n].calculateExpensive(); //direct call
}
auto t2= std::chrono::system_clock::now();
for(int n=0;n<numTest;n++){
accu+=e[n]->calculateCheap(); //indirect call
}
auto t3= std::chrono::system_clock::now();
std::cout<<"direct call time ="<<(t2-t1).count()<<std::endl;
std::cout<<"indirect call time ="<<(t3-t2).count()<<std::endl;
std::cout<<"print to disable compiler cheat="<<accu<<std::endl;
}
Direct call time's and Indirect call time is tuned to be similar as mentioned above (via editing calculateExpensive
).
Indirection cost = 3x float multiplication.
In my desktop (Visual Studio 2015 with -O2), it is 7x.
Is it to be expected that indirection cost around 3x of float multiplication?
If no, how is my test wrong?
(Thank enhzflep for suggesting improvement, it is edited.)
The cost of indirection is dominated by Cache Misses. Because honestly, Cache Misses are so much more expensive than anything else you are talking about, everything else ends up being rounding error.
Cache Misses and indirection can be far more expensive than your test indicates.
This is mostly because you have a mere 100,000 elements, and a CPU cache can cache every one of those floats. The sequential heap allocations will tend to clump.
You'll get a pile of cache misses, but not one for every element.
Both of your cases are indirect. The "indirect" case has to follow two pointers, and the "direct" case has to do one instance of pointer arithmetic. The "expensive" case may be suitable for some SIMD, especially if you have relaxed floating point accuracy (allowing multiplication reordering).
As seen here, or this image (not inline, I lack rights), the number of main memory references are going to dominate over almost anything else in the above code. A 2 Ghz CPU has a 0.5 ns cycle time, and a main memory reference is 100 ns or 200 cycles of latency.
Meanwhile, a desktop CPU can hit 8+ floating point operations per cycle if you can pull off vectorized code. That is a potential 1600x faster floating point operation than a single cache miss.
Indirection can cost you being able to use vectorized instructions (8x slowdown), and if everything is in cache can still require L2 cache references (14x slowdown) more often than the alternative. But these slowdowns are small compared to the 200 ns main memory reference delay.
Note that not all CPUs have the same level of vectorization, that some effort is being put into speeding up CPU/main memory delay, that FPUs have different characteristics, and a myriad of other complications.