Search code examples
c++optimizationx86ssefpu

is it possible/efficient to put fpu exception or inf into work?


I got such code

loop 10 M:

 if( fz != 0.0)     
 { 
  fhx += hx/fz; 
 } 

this is called 10 M times in loop needs to be very fast - I onlly need to catch the case when fz is not zero, not to make div by zero error, but it is a very rare case, indeed on 10M cases it should be zero, i dont know once , twice or newer

can i in some way get rid of this 10M of ifs and use "nan/inf" or maybe catching the exception and continue? (if fz is zero i need fhx += 0.0, i mean nothing just continue ? is it possible/efficient to put fpu exception or inf into work?

(Im using c++/mingw32)


Solution

  • You can, but it's probably not that useful. Masking won't be useful either under the circumstances.

    Exceptions are extremely slow when they happen, first a lot of microcoded complex stuff has to happen before the CPU even enters the kernel level exception handler, and then it has to hand it off to your process in a complicated and slow way too. On the other hand, they don't cost anything when they don't happen.

    But a comparison and a branch don't really cost anything either, as long as the branch is predictable, which a branch that is essentially never taken is. Of course it costs a little throughput to make them happen at all, but they're not in the critical path .. but even if they were, the real problem here is a division in every iteration.

    The throughput of that division is 1 per 14 cycles anyway (on Haswell - worse on other µarchs), unless fz is particularly "nice", and even then it's 1 per 8 cycles (again on Haswell). On Core2 it was more like 19 and 5, on P4 it was more like (in typical P4 fashion) one division per 71 cycles no matter what.

    A well-predicted branch and a comparison just disappear into that. On my 4770K, the difference between having a comparison and branch there or not disappeared into the noise (maybe if I run it enough times I will eventually obtain a statistically significant difference, but it will be tiny), with both of them winning randomly about half the time. The code I used for this benchmark was

    global bench
    proc_frame bench
        push r11
    [endprolog]
        xor ecx, ecx
        mov rax, rcx
        mov ecx, -10000000
        vxorps xmm1, xmm1
        vxorps xmm2, xmm2
        vmovapd xmm3, [rel doubleone]
    _bench_loop:
        imul eax, ecx, -0xAAAAAAAB  ; distribute zeroes somewhat randomly
        shr eax, 1                  ; increase to make more zeroes
        vxorps xmm0, xmm0
        vcvtsi2sd xmm0, eax
        vcomisd xmm0, xmm1          ; #
        jz _skip                    ; #
        vdivsd xmm0, xmm3, xmm0
        vaddsd xmm2, xmm0
    _skip:
        add ecx, 1
        jnz _bench_loop
        vmovapd xmm0, xmm2
        pop r11
        ret
    endproc_frame
    

    The other function was the same but with the two lines marked with a # commented out.

    The version that eventually consistently wins when the number of zeroes is increased is the one with the branch, indicating that division by zero is significantly slower than a branch misprediction. That's without even using the exception mechanism to create a programmer-visible exception, it's just from the cost of the micro-coded "weird case fix-up" thing running. But you don't have that many zeroes, so,

    TL;DR there isn't really a difference.