is it possible/efficient to put fpu exception or inf into work?

I got such code

loop 10 M:

 if( fz != 0.0)     
 { 
  fhx += hx/fz; 
 }

this is called 10 M times in loop needs to be very fast - I onlly need to catch the case when fz is not zero, not to make div by zero error, but it is a very rare case, indeed on 10M cases it should be zero, i dont know once , twice or newer

can i in some way get rid of this 10M of ifs and use "nan/inf" or maybe catching the exception and continue? (if fz is zero i need fhx += 0.0, i mean nothing just continue ? is it possible/efficient to put fpu exception or inf into work?

(Im using c++/mingw32)

Solution

You can, but it's probably not that useful. Masking won't be useful either under the circumstances.

Exceptions are extremely slow when they happen, first a lot of microcoded complex stuff has to happen before the CPU even enters the kernel level exception handler, and then it has to hand it off to your process in a complicated and slow way too. On the other hand, they don't cost anything when they don't happen.

But a comparison and a branch don't really cost anything either, as long as the branch is predictable, which a branch that is essentially never taken is. Of course it costs a little throughput to make them happen at all, but they're not in the critical path .. but even if they were, the real problem here is a division in every iteration.

The throughput of that division is 1 per 14 cycles anyway (on Haswell - worse on other µarchs), unless fz is particularly "nice", and even then it's 1 per 8 cycles (again on Haswell). On Core2 it was more like 19 and 5, on P4 it was more like (in typical P4 fashion) one division per 71 cycles no matter what.

A well-predicted branch and a comparison just disappear into that. On my 4770K, the difference between having a comparison and branch there or not disappeared into the noise (maybe if I run it enough times I will eventually obtain a statistically significant difference, but it will be tiny), with both of them winning randomly about half the time. The code I used for this benchmark was

global bench
proc_frame bench
    push r11
[endprolog]
    xor ecx, ecx
    mov rax, rcx
    mov ecx, -10000000
    vxorps xmm1, xmm1
    vxorps xmm2, xmm2
    vmovapd xmm3, [rel doubleone]
_bench_loop:
    imul eax, ecx, -0xAAAAAAAB  ; distribute zeroes somewhat randomly
    shr eax, 1                  ; increase to make more zeroes
    vxorps xmm0, xmm0
    vcvtsi2sd xmm0, eax
    vcomisd xmm0, xmm1          ; #
    jz _skip                    ; #
    vdivsd xmm0, xmm3, xmm0
    vaddsd xmm2, xmm0
_skip:
    add ecx, 1
    jnz _bench_loop
    vmovapd xmm0, xmm2
    pop r11
    ret
endproc_frame

The other function was the same but with the two lines marked with a # commented out.

The version that eventually consistently wins when the number of zeroes is increased is the one with the branch, indicating that division by zero is significantly slower than a branch misprediction. That's without even using the exception mechanism to create a programmer-visible exception, it's just from the cost of the micro-coded "weird case fix-up" thing running. But you don't have that many zeroes, so,

TL;DR there isn't really a difference.