Search code examples
mathfloating-pointieee-754

Does computer rounds the numbers in an operation first or round the result?


For instance, in opreation 9.4 - 9.0 - 0.4: Does computer first rounds the each number and store or does it make the computation with the help of some extra bits (this example is in double precision format) and then rounds the result? These are the stored values, but wasn't sure how to make this operation by hand to check if it rounds each number first or not.

binary( 9.4) = 0 10000000010 0010110011001100110011001100110011001100110011001101

binary(-9.0) = 1 10000000010 0010000000000000000000000000000000000000000000000000

binary(-0.4) = 1 01111111101 1001100110011001100110011001100110011001100110011010

binary(9.4 - 9.0 - 0.4) = 0 01111001100 0000000000000000000000000000000000000000000000000000


Solution

  • Generally, the computer will convert the numerals in 9.4 - 9.0 - 0.4 to numbers in an internal form, and then it will perform the arithmetic operations. These conversions generally round their results.

    Consider the text in source code 9.4 - 9.0 - 0.4. Nothing in there is a number. That text is a string composed of characters. It contains the characters “9”, ”.”, “4”, “ ”, “-”, and so on. Generally, a computer converts this text to other forms for processing. You could write software that works with numbers in a text format, but this is rare. Generally, when we are using a programming language, either compiled or interpreted, the numerals in this text will be converted to some internal form. (A “numeral” is a sequence of symbols representing a number. So “9.4” is a numeral representing 9.4.)

    IEEE-754 binary64 is a very common floating-point format. In this format, each representable number is expressed in units of some power of two. For example, the numbers .125, .250, .375, and .500 are also representable because they are multiples of 1/8, which is 2−3. However, 9.4 is not a multiple of any power of two, so it cannot be represented in IEEE-754 binary64.

    When 9.4 is converted to binary64, the nearest representable value is 9.4000000000000003552713678800500929355621337890625. (This is a multiple of 2−50, which is the power of two used when representing numbers near 9.4, specifically numbers from 8 [inclusive] to 16 [exclusive].)

    9 is representable in binary64, so 9 is converted to 9.

    0.4 is not representable in binary64. When 0.4 is converted to binary64, the nearest representable value is 0.40000000000000002220446049250313080847263336181640625. This is a multiple of 2−54, which is the power of two used for numbers from ¼ to ½.

    In 9.4 - 9.0 - 0.4, the result of the first subtraction is 0.4000000000000003552713678800500929355621337890625. This is exactly representable, so there is no rounding at this point. Then, when 0.4 is subtracted, after it has been converted to the value above, the result is 0.00000000000000033306690738754696212708950042724609375. This is also exactly representable, so there is again no rounding at this point.

    The above describes what happens if binary64 is used throughout. Many programming languages, or specific implementations of them, use binary64. Some may use other formats. Some languages permit implementations to use a mix of formats—they may use a wider format than binary64 for doing calculations and convert to binary64 for the final result. This can cause you to see different results than the above.

    So the answer to your question is that, with floating-point arithmetic, each operation produces a result that is equal to the number you would get by computing the exact real-number result and then rounding that real-number results to the nearest value representable in the floating-point format. (Rounding is most often done by rounding to the nearest representable value, with ties resolved by one of several methods, but other rounding choices are possible, such as rounding down.)

    The operations generally do not round their operands. (There are exceptions, such as that some processors may convert subnormal inputs to zero.) However, those operands must be produced first, such as by converting source text to a representable number. Those conversions are separate operations from the subtraction or other operations that follow.