Search code examples
floating-pointprecisionepsilonuncertainty

Float type variable uncertainty


I developed an image processing software and I need to do a numerical analysis of it, considering the error propagation associated to its operations and the uncertainty of float type variables caused by the inherent rounding up that happens with this type of variables.

Considering the IEEE 754 standard the machine epsilon for the float type variables is 1.19e-07. From what I understood, this value is the distance to the nearest representable float.

I did some testing to find if this is true by adding a float value to this epsilon as such: x + epsilon == x. This notion does not hold for every value of the float range, which is understandable since great values of floats have more uncertainty associated with them caused by the rounding and the limited number of bits used to represent them.

My question is what is the uncertainty associated to a float value in such a way that (x + y) || (x - y) == x being the float value x and the float uncertainty y.

It might be my lack of knowledge about the english language but I can not seem to understand the literature about this topic.

If someone could be as detailed as possible can you explain me the error in a simple operation such as the following?

float result = valA * 0.587f + valB * 0.331f;

If I knew the uncertainty of a float type variable this error could be simply calculated with this formulas, right?

enter image description here


Solution

  • Introduction

    This answer presents an initial examination of the error in:

    float result = valA * 0.587f + valB * 0.331f;
    

    In this answer, values in the floating-point format and expressions computed with the floating-point format will be represented with code style, as in z or x * y. Mathematical variables will use italic and will not be in code style, as in z or xy.

    I assume that all arithmetic is done with IEEE-754 basic 32-bit binary floating-point. This format is commonly used for the float type, although some programming language implementations mix precisions, possibly using double or other precision while evaluating expressions of float type. I also assume all arithmetic is done using the round-to-nearest mode, with ties to the number with the even low bit.

    This format as 24 bits in the significand, so the unit of least precision (ULP) is normally 2−23 times the value of the most significant bit. This is the step size between representable values. For example, for values in [1, 2), the ULP is 2−23. For values in [128, 256), the ULP is 27•2−23 = 2−16. (For subnormal values, the significand has fewer bits. The lowest the ULP can be is 2−149. Beyond the largest finite representable value, the step size to the next representable value is infinite. However, in this question, only values of modest value are involved, so we can neglect infinity.)

    The result of computing any operation with correct rounding is at most ½ ULP away from the correct answer. That is, if we compute z = x + y, for example, the computed result z differs from the exact mathematical result z = x + y by at most ½ ULP of z. (Although z is an exact mathematical result with infinite precision, we use its magnitude to determine which range it falls in in the floating-point format, which determines what we mean by the ULP of z.) The reason the error is at most ½ ULP is that, if the two representable values nearest z are z0 and z1, we must have z0zz1, and if ½ ULP < z1z, then zz0 < ½ ULP (because z1z0 = 1 ULP, by definition of an ULP.) Therefore, in choosing the nearest representable value, we would pick the closer of z0 and z1, so the error never exceeds ½ ULP.

    As stated in a comment, valA, valB, and result are in [0, 256).

    Symbolic Analysis

    By the time we start computing valA * 0.587f + valB * 0.331f, valA and valB have some errors from previous operations. That is, ideally, using exact mathematics, we would have computed some numbers A and B, but instead the computer calculated valA and valB, and the differences are eA = valAA and eB = valB - B.

    Ideally, we would like to compute the number R such that R is, using exact mathematics, A • .587 + B • .331. When we use computer arithmetic:

    • 0.587f will be converted from .587 to the floating-point format, and the result will have some rounding error e0, so the result is 0.587f = 0.587 + e0.
    • 0.331f will be converted from .331 to the floating-point format, and the result will have some rounding error e1, so the result is 0.331f = .331 + e1.
    • valA * 0.587f will be computed with some error e2, so the result will be valA * 0.587f = valA0.587f + e2.
    • valB * 0.331f will be computed with some error e3, so the result will be valB * 0.331f = valB0.331f + e3.
    • The two products will be added, with some error e4, so the result will be valA * 0.587f + valB * 0.331f = valA * 0.587f + valB * 0.331f + e4.

    Now we can substitute the expressions, so:

    • valA * 0.587f + valB * 0.331f = (valA0.587f + e2) + (valB0.331f + e3 + e4.
    • valA * 0.587f + valB * 0.331f = (valA • (0.587 + e0) + e2) + (valB • (.331 + e1) + e3) + e4.
    • valA * 0.587f + valB * 0.331f = ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.

    With this, we have expressed the computed result, valA * 0.587f + valB * 0.331f, as an exact mathematical expression (of variables with incompletely known values), ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4.

    Numerical Analysis

    Next, we can place some bounds on the errors. e0 and e1 are easy, their magnitudes are at most ½ ULP of .587 and .331, respectively. .587 is in [½, 1), so its ULP is 2−24, and .331 is in [¼, ½), so its ULP is 2−25. So |e0| ≤= 2−25, and |e1| ≤= 2−26.

    Bounds on e2 and e3 depend on the magnitudes of valA * 0.587f and valB * 0.331f. Since val < 256, valA * 0.587f < 256, so its ULP is at most 2−16, and |e2| ≤ 2−17. With valB, we can see that valB * 0.331f < 128, so the ULP of valB * 0.331f is at most 2−17, and |e3| ≤ 2−18.

    Finally, we have the error e4 that occurs in the final addition of valA * 0.587f + valB * 0.331f. We have assumed this is less than 256, so its ULP is at most 2−16, and |e4| ≤ 2−17.

    Looking at the mathematical expression of the computed result, ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4, we can see that the largest possible error occurs when e0, e1, e2, e3, and e4 have the greatest values (unless eA or eB is huge and negative, which we assume not to be true). So we can substitute the upper bounds we have prepared for these errors:

    ((A + eA) • (0.587 + 2−25) + 2−17) + ((B + eB) • (.331 + 2−26) + 2−18) + 2−17.

    In the interests of time, I evaluated this with Maple. (It might be a bit more illuminating to expand the expression manually and retain some of the factors rather than consolidating coefficients into single numbers, but I leave that to the reader.) The result is:

    2462056573/4194304000 • A + 2462056573/4194304000 • eA + 5/262144 + 2776629373/8388608000 • B + 2776629373/8388608000 • eB.

    The ideal result is A • .587 + B • .331. When we subtract that from the above, the result is a bound on the error in the computation:

    1/33554432 • A + 2462056573/4194304000 • eA + 5/262144 + 1/67108864 * B + 2776629373/8388608000 • eB.

    Since A < 256 and B < 256, we can substitute 256 for A and for B, yielding:

    1/32768 + 2462056573/4194304000 • eA + 2776629373/8388608000 • eB.

    Reversing a bit of Maple’s arithmetic, that is:

    2−15 + (.587 + 2−25) • eA + (.331 + 2−26) • eB.

    So, that is an upper bound on the error in valA * 0.587f + valB * 0.331f. It could possibly be reduced more with additional information about the relationship between valA and valB. Also, the errors in converting .587 and .331 to float are exactly known, so those should be used instead of the bounds I used as illustration in this answer.

    One also needs to establish a lower bound on the error. The rounding errors could be negative, and we have to ask what the lowest possible value of ((A + eA) • (0.587 + e0) + e2) + ((B + eB) • (.331 + e1) + e3) + e4 is. As I am out of time for now, that is left for the reader.

    Addendum

    e0 is 13/1048576000. e1 is 1/4194304000. Then the upper bound on the error can be reduced to 731/32768000 + 4924113/8388608 • eA + 11106517/33554432 • eB, which is:

    .731•2−15 + (.587 + .013•2−20) • eA + (.331 + .001•2−22) • eB.