Search code examples
c#floating-pointprecision

How low can a delta value be to compare two 32-bit floating point values?


I have a suit of 240 unit tests of FIR filters convolutions with different number of taps.

Intended usage:

It's for writing an audio filter in Unity, the initial, naive convolution was done in managed code but was too slow for real-time audio processing.

The unit tests do test the exactness of different vectorized FIR convolution algorithms, one be picked up and implemented using Unity Burst, translating it to SIMD-aware native code.

To be able to decide on which convolution algorithm performs better, I wrote few:

  • naive convolution
    • used as a reference for comparison
  • vectorized convolution
    • inner, outer, inner and outer loop
  • half-band versions of the above
    • 1 tap in 2 iterating all the taps
    • 1 tap in 2 iterating half the taps (leveraging filter symmetry)

Actual situation:

All tests are passing and as I'm polishing, I noticed the delta for comparing was 1E-04f.

I lowered it down to 1E-05f, that worked, tried 1E-06f but then pass rate is < 50% then.

Below are the results of an input signal 0, 1, 2, 3 ... 28, 29, 30, 31.

Example for 1E-05f:

   index         expected           actual       difference    match
   0   -2.7051452E-34   -2.7051452E-34    0.000000E+000     True
   1    0.00011297482    0.00011297482    0.000000E+000     True
   2    0.00022594964    0.00022594964    0.000000E+000     True
   3    -0.0010179491    -0.0010179491    0.000000E+000     True
   4    -0.0022618477    -0.0022618477    0.000000E+000     True
   5     0.0019123415     0.0019123415    0.000000E+000     True
   6       0.00608653       0.00608653    0.000000E+000     True
   7    -0.0052014478    -0.0052014478    0.000000E+000     True
   8     -0.016489426     -0.016489424    1.862645E-009     True
   9      0.009599558      0.009599558    0.000000E+000     True
  10      0.035688534      0.035688538    3.725290E-009     True
  11     -0.026160091     -0.026160091    0.000000E+000     True
  12      -0.08800873      -0.08800873    0.000000E+000     True
  13       0.16196862       0.16196862    0.000000E+000     True
  14       0.91199124        0.9119913    5.960464E-008     True
  15          1.97384          1.97384    0.000000E+000     True
  16        3.0356889        3.0356886    2.384186E-007     True
  17        4.0095997        4.0095997    0.000000E+000     True
  18        4.9835114         4.983511    4.768372E-007     True
  19        5.9947987         5.994799    4.768372E-007     True
  20         7.006087        7.0060863    4.768372E-007     True
  21         8.001913         8.001913    0.000000E+000     True
  22         8.997738         8.997738    0.000000E+000     True
  23         9.998984         9.998981    2.861023E-006     True
  24        11.000226        11.000226    0.000000E+000     True
  25       12.0001135       12.0001135    0.000000E+000     True
  26               13               13    0.000000E+000     True
  27        13.999999               14    9.536743E-007     True
  28        15.000001        15.000001    0.000000E+000     True
  29        15.999998               16    1.907349E-006     True
  30               17        17.000002    1.907349E-006     True
  31        18.000002               18    1.907349E-006     True

Example for 1E-06f:

   index         expected           actual       difference    match
   0   -2.7051452E-34   -2.7051452E-34    0.000000E+000     True
   1    0.00011297482    0.00011297482    0.000000E+000     True
   2    0.00022594964    0.00022594964    0.000000E+000     True
   3    -0.0010179491    -0.0010179491    0.000000E+000     True
   4    -0.0022618477    -0.0022618477    0.000000E+000     True
   5     0.0019123415     0.0019123415    0.000000E+000     True
   6       0.00608653       0.00608653    0.000000E+000     True
   7    -0.0052014478    -0.0052014478    0.000000E+000     True
   8     -0.016489426     -0.016489424    1.862645E-009     True
   9      0.009599558      0.009599558    0.000000E+000     True
  10      0.035688534      0.035688538    3.725290E-009     True
  11     -0.026160091     -0.026160091    0.000000E+000     True
  12      -0.08800873      -0.08800873    0.000000E+000     True
  13       0.16196862       0.16196862    0.000000E+000     True
  14       0.91199124        0.9119913    5.960464E-008     True
  15          1.97384          1.97384    0.000000E+000     True
  16        3.0356889        3.0356886    2.384186E-007     True
  17        4.0095997        4.0095997    0.000000E+000     True
  18        4.9835114         4.983511    4.768372E-007     True
  19        5.9947987         5.994799    4.768372E-007     True
  20         7.006087        7.0060863    4.768372E-007     True
  21         8.001913         8.001913    0.000000E+000     True
  22         8.997738         8.997738    0.000000E+000     True
  23         9.998984         9.998981    2.861023E-006    False
  24        11.000226        11.000226    0.000000E+000     True
  25       12.0001135       12.0001135    0.000000E+000     True
  26               13               13    0.000000E+000     True
  27        13.999999               14    9.536743E-007     True
  28        15.000001        15.000001    0.000000E+000     True
  29        15.999998               16    1.907349E-006    False
  30               17        17.000002    1.907349E-006    False
  31        18.000002               18    1.907349E-006    False

This is how I perform the comparison:

const float delta = 1E-05f;

var expectedValue = expected[i];
var actualValue   = actual[i];
var difference    = Math.Abs(expectedValue - actualValue);
var failed        = difference > delta;

According to the documentation, the precision is about 6 to 9 digits.

Question:

Is the lowest delta I can compare to is indeed 1E-05f or is 1E-06f possible?


Solution

  • Bounding Floating-Point Rounding Error

    I have a suit of 240 unit tests of FIR filters convolutions with different number of taps.

    The rounding error that occurs in floating-point operations depends on the values of the operands and on the operations performed. You have not given those values or the operations. (An FIR is mathematically a linear function of multiple variables, but floating-point arithmetic is not associative, so grouping the operations in various ways may reduce the error.)

    Without exact values, it may be possible to compute bounds on the errors given bounds on the values, but you have not given bounds on the values.

    When using IEEE-754 arithmetic with round-to-nearest, the maximum error on a single operation is ½ULP(y), where y is the ideal mathematical result and ULP(y) is the unit of least precision for y. The unit of least precision is the position value of the lowest bit in the significand as scaled for numbers in the range of y.

    For the IEEE-754 binary32 format, commonly used for float, ULP(y) is at most y•2−23 for y within the finite range of the format. (Outside the finite range, the potential error is unbounded.)

    Consider an FIR implementation that accumulates the total iteratively, one term at a time, on data and taps each with magnitude at most m. The first product is at most m2, so a bound on the error is ½ULP(m2). Then another term is computed, with bound at most ½ULP(m2), and added to the first product, producing a result with magnitude at most 2m2 (except it might be a bit more if the rounding errors increased the terms), so that addition has an error bound of ½ULP(2m2) = ULP(m2). The total bounds we know so far are ½ULP(m2) + ½ULP(m2) + ULP(m2) = 2 ULP(m2). Given t taps, there are t individual multiplications with error bound ½ULP(m2) and t−1 additions with error bounds ½ULP(im2) for 1 < i ≤ t.

    Since ULP(y) is bounded by y•2−23, a bound on the total error is t • ½ m2 • 2−23 + sum(½•im2•2−23 for 1 < it) = tm2•2−24 + m•2−24•sum(i for 1 < it) = tm2•2−24 + m2•2−24•(t(t+1)/2−1) = 2−24m2t2+3/2•t−1).

    Except for neglecting the possibility of operations rounding up so as to increase the ULP (statistically unlikely), this is an absolute bound on the error. It may be larger than you observe in practice. With additional information, it might be possible to derive a lower bound.

    Note that it increases with the square of the number of taps.

    Eliminating Floating-Point Rounding Error

    Your goal in testing an FIR is likely to test the code that was written to implement the FIR, not to test the hardware implementation of floating-point arithmetic. If you can test whether the software is performing the desired additions and multiplications on the desired data, that is good test coverage, and you do not need to test that the floating-point arithmetic is being performed correctly.

    For this purpose, it suffices to construct test cases that exercise the fundamental FIR arithmetic.

    For example, construct input data that contains all zeros except for a single one somewhere in the interior. The output data should contain all zeros before the filter reaches the impulse, then it should contain each of the filter coefficients, one by one in the appropriate order, and then zeros should resume. These outputs will have no rounding errors. A simple test like this tests whether the FIR implementation is using the expected coefficients in the expected places, for data in the interior of the signal. Additional data could test the edges.

    Of course, the FIR implementation might be accidentally using a “maximum” operation between terms instead of an addition. And a single impulse tests only one point in the FIR, which is inadequate for testing a vectorized implementation. We could construct data in which the filter coefficients are, say, 1, 8, 64, 512,… 221, and the signal data contains some pattern such as 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7. The expected outputs should all be exact (no rounding errors) because all of the results, including intermediate calculations are exactly representable in the binary32 format. But this data is a good test of whether the expected calculations for an FIR are being performed. You could also randomize the order of the signal data (and the results would still be exact).

    In summary, testing using a filter with coefficients 23i for 0 ≤ i < 8 and a signal with random data from { 1, 2, 3, 4, 5, 6, 7 } will have no rounding errors and will be a good test for bugs in an FIR implementation.

    More broadly, for a filter with t taps, we could set the signal data and the filter coefficients to any integer values of magnitude at most sqrt(224/t), and the expected results would always be representable in the binary32 format. You could select randomly or in patterns that are useful to you.