Search code examples
cieee-754

Skipping "test for zero" checks in IEEE 754


I read this in a popular book about computer graphics,

There are many numeric computations that become much simpler if the programmer takes advantage of the IEEE rules. For example, consider the expression:

a = 1 / (1/b + 1/c)

Such expressions arise with resistors and lenses. If divide-by-zero resulted in a program crash (as was true in many systems before IEEE floating-point), then two if statements would be required to check for small or zero values of b or c. Instead, with IEEE floating-point, if b or c is zero, we will get a zero value for a as desired.

But what about the case where b=+0 and c=-0? Then a=1/inf-inf=nan. Is the book wrong about this optimizations, or have I misunderstood something? It seems like we'll still need at least one check for the signs of b & c rather than no checks at all.

Edit One suggestion in the comments was just to do a post-check for NaN. Is that the idiomatic way to "take advantage" of these number type?

bool is_nan(float x) { return x != x; }

float optic_thing(float b, float c) {
  float result = 1.0f / (1.0f/b + 1.0f/c);
  if (is_nan(result)) result = 0;
  return result;
}

Solution

  • But what about the case where b=+0 and c=-0?

    Convert -0.0 to +0.0 without branching by adding 0.0.

    int main(void) {
      double b = +0.0;
      double c = -0.0;
      printf("%e\n", b);
      printf("%e\n", c);
      printf("%e\n", 1.0/(1.0/b + 1.0/c));
      b += 0.0;
      c += 0.0;
      printf("%e\n", 1.0/(1.0/b + 1.0/c));
      return 0;
    }
    

    Output

    0.000000e+00
    -0.000000e+00
    nan
    0.000000e+00
    

    [Edit] On the other hand, any values near 0.0, but not 0.0, are likely numeric artifacts and not accurate resistance values. The above still has trouble with tiny value pairs like b = DBL_TRUE_MIN; c = -b; The issue is 1.0/tiny --> Infinity and +Infinity + -Infinity --> NAN. Could resort to either using wider floating point for the quotient calculation or narrow the operands.

      b += 0.0;
      c += 0.0;
      printf("%Le\n", 1.0/(1.0L/b + 1.0L/c));
    
      // Lose some precision, but we are talking about real resistors/lenses here.
      b = (float) b + 0.0f;
      c = (float) c + 0.0f;
      printf("%e\n", 1.0/(1.0/b + 1.0/c));