Search code examples
c++performancecpucpu-cachebranch-prediction

Cache miss penalty on branching


I wonder is it faster to replace branching with 2 multiplications or no (due to cache miss penalty)?
Here is my case:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

if(dot<0){
    dir.x = -dir.x;
    dir.y = -dir.y;
}

And I'm trying to replace it with:

float dot = rib1.x*-dir.y + rib1.y*dir.x;

int sgn = (dot  < 0.0) - (0.0 < dot ); //returns -1 or 1 (no branching here, tested)
dir.x *= sgn;
dir.y *= sgn;

Solution

  • The cost of the multiplication depends on several factors, whether you use 32-bit or 64-bit floats, and whether you enable SSE or not. The cost of two float multiplications is 10 cycles according to this source: http://www.agner.org/optimize/instruction_tables.pdf

    The cost of the branch also depends on several factors. As a rule of thumb, do not worry about branches in your code. The exact behaviour of the branch predictor on the CPU will define the performance, but in this case you should probably expect that the branch will be unpredictable at best, so this is likely to lead to a lot of branch mispredictions. The cost of a branch misprediction is 10-30 cycles according to this source: http://valgrind.org/docs/manual/cg-manual.html

    The best advice anyone can give here is to profile and test. I would guess that on a modern Core i7 the two multiplications should be faster than the branch, if the range of input varies sufficiently as to cause sufficient branch mispredictions as to outweigh the cost of the additional multiplication.

    Assuming 50% miss rate, the cost of the branch averages 15 cycles (30 * 0.5), the cost of the float mul is 10 cycles.


    EDIT: Added links, updated estimated instruction cost.