I have a suit of 240 unit tests of FIR filters convolutions with different number of taps.
Intended usage:
It's for writing an audio filter in Unity, the initial, naive convolution was done in managed code but was too slow for real-time audio processing.
The unit tests do test the exactness of different vectorized FIR convolution algorithms, one be picked up and implemented using Unity Burst, translating it to SIMD-aware native code.
To be able to decide on which convolution algorithm performs better, I wrote few:
Actual situation:
All tests are passing and as I'm polishing, I noticed the delta for comparing was 1E-04f
.
I lowered it down to 1E-05f
, that worked, tried 1E-06f
but then pass rate is < 50% then.
Below are the results of an input signal 0, 1, 2, 3 ... 28, 29, 30, 31
.
Example for 1E-05f:
index expected actual difference match
0 -2.7051452E-34 -2.7051452E-34 0.000000E+000 True
1 0.00011297482 0.00011297482 0.000000E+000 True
2 0.00022594964 0.00022594964 0.000000E+000 True
3 -0.0010179491 -0.0010179491 0.000000E+000 True
4 -0.0022618477 -0.0022618477 0.000000E+000 True
5 0.0019123415 0.0019123415 0.000000E+000 True
6 0.00608653 0.00608653 0.000000E+000 True
7 -0.0052014478 -0.0052014478 0.000000E+000 True
8 -0.016489426 -0.016489424 1.862645E-009 True
9 0.009599558 0.009599558 0.000000E+000 True
10 0.035688534 0.035688538 3.725290E-009 True
11 -0.026160091 -0.026160091 0.000000E+000 True
12 -0.08800873 -0.08800873 0.000000E+000 True
13 0.16196862 0.16196862 0.000000E+000 True
14 0.91199124 0.9119913 5.960464E-008 True
15 1.97384 1.97384 0.000000E+000 True
16 3.0356889 3.0356886 2.384186E-007 True
17 4.0095997 4.0095997 0.000000E+000 True
18 4.9835114 4.983511 4.768372E-007 True
19 5.9947987 5.994799 4.768372E-007 True
20 7.006087 7.0060863 4.768372E-007 True
21 8.001913 8.001913 0.000000E+000 True
22 8.997738 8.997738 0.000000E+000 True
23 9.998984 9.998981 2.861023E-006 True
24 11.000226 11.000226 0.000000E+000 True
25 12.0001135 12.0001135 0.000000E+000 True
26 13 13 0.000000E+000 True
27 13.999999 14 9.536743E-007 True
28 15.000001 15.000001 0.000000E+000 True
29 15.999998 16 1.907349E-006 True
30 17 17.000002 1.907349E-006 True
31 18.000002 18 1.907349E-006 True
Example for 1E-06f:
index expected actual difference match
0 -2.7051452E-34 -2.7051452E-34 0.000000E+000 True
1 0.00011297482 0.00011297482 0.000000E+000 True
2 0.00022594964 0.00022594964 0.000000E+000 True
3 -0.0010179491 -0.0010179491 0.000000E+000 True
4 -0.0022618477 -0.0022618477 0.000000E+000 True
5 0.0019123415 0.0019123415 0.000000E+000 True
6 0.00608653 0.00608653 0.000000E+000 True
7 -0.0052014478 -0.0052014478 0.000000E+000 True
8 -0.016489426 -0.016489424 1.862645E-009 True
9 0.009599558 0.009599558 0.000000E+000 True
10 0.035688534 0.035688538 3.725290E-009 True
11 -0.026160091 -0.026160091 0.000000E+000 True
12 -0.08800873 -0.08800873 0.000000E+000 True
13 0.16196862 0.16196862 0.000000E+000 True
14 0.91199124 0.9119913 5.960464E-008 True
15 1.97384 1.97384 0.000000E+000 True
16 3.0356889 3.0356886 2.384186E-007 True
17 4.0095997 4.0095997 0.000000E+000 True
18 4.9835114 4.983511 4.768372E-007 True
19 5.9947987 5.994799 4.768372E-007 True
20 7.006087 7.0060863 4.768372E-007 True
21 8.001913 8.001913 0.000000E+000 True
22 8.997738 8.997738 0.000000E+000 True
23 9.998984 9.998981 2.861023E-006 False
24 11.000226 11.000226 0.000000E+000 True
25 12.0001135 12.0001135 0.000000E+000 True
26 13 13 0.000000E+000 True
27 13.999999 14 9.536743E-007 True
28 15.000001 15.000001 0.000000E+000 True
29 15.999998 16 1.907349E-006 False
30 17 17.000002 1.907349E-006 False
31 18.000002 18 1.907349E-006 False
This is how I perform the comparison:
const float delta = 1E-05f;
var expectedValue = expected[i];
var actualValue = actual[i];
var difference = Math.Abs(expectedValue - actualValue);
var failed = difference > delta;
According to the documentation, the precision is about 6 to 9 digits.
Question:
Is the lowest delta I can compare to is indeed 1E-05f
or is 1E-06f
possible?
I have a suit of 240 unit tests of FIR filters convolutions with different number of taps.
The rounding error that occurs in floating-point operations depends on the values of the operands and on the operations performed. You have not given those values or the operations. (An FIR is mathematically a linear function of multiple variables, but floating-point arithmetic is not associative, so grouping the operations in various ways may reduce the error.)
Without exact values, it may be possible to compute bounds on the errors given bounds on the values, but you have not given bounds on the values.
When using IEEE-754 arithmetic with round-to-nearest, the maximum error on a single operation is ½ULP(y), where y is the ideal mathematical result and ULP(y) is the unit of least precision for y. The unit of least precision is the position value of the lowest bit in the significand as scaled for numbers in the range of y.
For the IEEE-754 binary32 format, commonly used for float
, ULP(y) is at most y•2−23 for y within the finite range of the format. (Outside the finite range, the potential error is unbounded.)
Consider an FIR implementation that accumulates the total iteratively, one term at a time, on data and taps each with magnitude at most m. The first product is at most m2, so a bound on the error is ½ULP(m2). Then another term is computed, with bound at most ½ULP(m2), and added to the first product, producing a result with magnitude at most 2m2 (except it might be a bit more if the rounding errors increased the terms), so that addition has an error bound of ½ULP(2m2) = ULP(m2). The total bounds we know so far are ½ULP(m2) + ½ULP(m2) + ULP(m2) = 2 ULP(m2). Given t taps, there are t individual multiplications with error bound ½ULP(m2) and t−1 additions with error bounds ½ULP(i•m2) for 1 < i ≤ t.
Since ULP(y) is bounded by y•2−23, a bound on the total error is t • ½ m2 • 2−23 + sum(½•i•m2•2−23 for 1 < i ≤ t) = tm2•2−24 + m•2−24•sum(i for 1 < i ≤ t) = tm2•2−24 + m2•2−24•(t(t+1)/2−1) = 2−24m2(½t2+3/2•t−1).
Except for neglecting the possibility of operations rounding up so as to increase the ULP (statistically unlikely), this is an absolute bound on the error. It may be larger than you observe in practice. With additional information, it might be possible to derive a lower bound.
Note that it increases with the square of the number of taps.
Your goal in testing an FIR is likely to test the code that was written to implement the FIR, not to test the hardware implementation of floating-point arithmetic. If you can test whether the software is performing the desired additions and multiplications on the desired data, that is good test coverage, and you do not need to test that the floating-point arithmetic is being performed correctly.
For this purpose, it suffices to construct test cases that exercise the fundamental FIR arithmetic.
For example, construct input data that contains all zeros except for a single one somewhere in the interior. The output data should contain all zeros before the filter reaches the impulse, then it should contain each of the filter coefficients, one by one in the appropriate order, and then zeros should resume. These outputs will have no rounding errors. A simple test like this tests whether the FIR implementation is using the expected coefficients in the expected places, for data in the interior of the signal. Additional data could test the edges.
Of course, the FIR implementation might be accidentally using a “maximum” operation between terms instead of an addition. And a single impulse tests only one point in the FIR, which is inadequate for testing a vectorized implementation. We could construct data in which the filter coefficients are, say, 1, 8, 64, 512,… 221, and the signal data contains some pattern such as 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7. The expected outputs should all be exact (no rounding errors) because all of the results, including intermediate calculations are exactly representable in the binary32 format. But this data is a good test of whether the expected calculations for an FIR are being performed. You could also randomize the order of the signal data (and the results would still be exact).
In summary, testing using a filter with coefficients 23i for 0 ≤ i < 8 and a signal with random data from { 1, 2, 3, 4, 5, 6, 7 } will have no rounding errors and will be a good test for bugs in an FIR implementation.
More broadly, for a filter with t taps, we could set the signal data and the filter coefficients to any integer values of magnitude at most sqrt(224/t), and the expected results would always be representable in the binary32 format. You could select randomly or in patterns that are useful to you.