I got such code
loop 10 M:
if( fz != 0.0)
{
fhx += hx/fz;
}
this is called 10 M times in loop needs to be very fast - I onlly need to catch the case when fz is not zero, not to make div by zero error, but it is a very rare case, indeed on 10M cases it should be zero, i dont know once , twice or newer
can i in some way get rid of this 10M of ifs and use "nan/inf" or maybe catching the exception and continue? (if fz is zero i need fhx += 0.0, i mean nothing just continue ? is it possible/efficient to put fpu exception or inf into work?
(Im using c++/mingw32)
You can, but it's probably not that useful. Masking won't be useful either under the circumstances.
Exceptions are extremely slow when they happen, first a lot of microcoded complex stuff has to happen before the CPU even enters the kernel level exception handler, and then it has to hand it off to your process in a complicated and slow way too. On the other hand, they don't cost anything when they don't happen.
But a comparison and a branch don't really cost anything either, as long as the branch is predictable, which a branch that is essentially never taken is. Of course it costs a little throughput to make them happen at all, but they're not in the critical path .. but even if they were, the real problem here is a division in every iteration.
The throughput of that division is 1 per 14 cycles anyway (on Haswell - worse on other µarchs), unless fz
is particularly "nice", and even then it's 1 per 8 cycles (again on Haswell). On Core2 it was more like 19 and 5, on P4 it was more like (in typical P4 fashion) one division per 71 cycles no matter what.
A well-predicted branch and a comparison just disappear into that. On my 4770K, the difference between having a comparison and branch there or not disappeared into the noise (maybe if I run it enough times I will eventually obtain a statistically significant difference, but it will be tiny), with both of them winning randomly about half the time. The code I used for this benchmark was
global bench
proc_frame bench
push r11
[endprolog]
xor ecx, ecx
mov rax, rcx
mov ecx, -10000000
vxorps xmm1, xmm1
vxorps xmm2, xmm2
vmovapd xmm3, [rel doubleone]
_bench_loop:
imul eax, ecx, -0xAAAAAAAB ; distribute zeroes somewhat randomly
shr eax, 1 ; increase to make more zeroes
vxorps xmm0, xmm0
vcvtsi2sd xmm0, eax
vcomisd xmm0, xmm1 ; #
jz _skip ; #
vdivsd xmm0, xmm3, xmm0
vaddsd xmm2, xmm0
_skip:
add ecx, 1
jnz _bench_loop
vmovapd xmm0, xmm2
pop r11
ret
endproc_frame
The other function was the same but with the two lines marked with a # commented out.
The version that eventually consistently wins when the number of zeroes is increased is the one with the branch, indicating that division by zero is significantly slower than a branch misprediction. That's without even using the exception mechanism to create a programmer-visible exception, it's just from the cost of the micro-coded "weird case fix-up" thing running. But you don't have that many zeroes, so,
TL;DR there isn't really a difference.