Search code examples
floating-pointcomputer-scienceprecisioncpu-architecture

Is it possible to omit rounding of intermediate results during arithmetic operation on multiple FP operands?


Is there the possibility to do an arithmetic operation on multiple floating point operands without rounding intermediate results and just round the final result, and are there any architectures currently doing it? because as far as I've seen after two floating point operands are used in an addition/subtraction operation the result gets rounded before being used as an operand for another operation, also I've seen this.

EDIT:

Below are some examples considering single-precision format to elucidate the concept, 3 least significant bits of intermediate 27-bit mantissas taking part in arithmetic operation are guard, round and sticky bits; from examples You can see, using the intermediate mantissa bit structure utilized in a IEEE754 compliant FP system, avoiding rounding of intermediate values is possible and when it's done it'll achieve a more accurate result:

1_Example 1

A-B = 101001000010001110110100 1 0 1×2^EXP

C   = 100011001011001010010100×2^(EXP-4) --> 
C   = 000010001100101100101001 0 1 0×2^EXP

1_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B   = 101001000010001110110101×2^EXP
        A-B-C = 100110110101100010001011 1 1 0×2^EXP
Rounded A-B-C = 100110110101100010001100×2^EXP

1_2_If A-B-C is calculated without A-B being rounded:

        A-B-C = 100110110101100010001011 0 1 1×2^EXP
Rounded A-B-C = 100110110101100010001011×2^EXP

2_Example 2

A-B = 100001001100101011100000 1 0 1×2^EXP
C   = 101001011010001110001000×2^(EXP-5) --> 
C   = 000001010010110100011100 0 1 0×2^EXP

2_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B   = 100001001100101011100001×2^EXP
A-B-C         = 011111111001110111000100 1 1 0×2^EXP
A-B-C shifted 1 bit to left 
              = 111111110011101110001001 1 0 0×2^(EXP-1)
Rounded A-B-C = 11111111001110111000101×2^(EXP-1)

2_2_If A-B-C is calculated without A-B being rounded:

A-B-C         = 011111111001110111000100 0 1 1×2^EXP
A-B-C shifted 1 bit to left 
              = 111111110011101110001000 1 1 0×2^(EXP-1)
Rounded A-B-C = 111111110011101110001001×2^(EXP-1)

3_Example 3

A-B = 10000001101000110101001 1 1 1×2^EXP

C   = 100010100101011010010000×2^(EXP-6) -->
C   = 000000100010100101011010 0 1 0×2^EXP

3_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 10000001101000110101010×2^EXP
   A-B-C    = 01111101010100001001111 1 1 0×2^EXP
A-B-C shifted 1 bit to left 
            = 11111010101000010011111 1 0 0×2^(EXP-1)
Rounded A-B-C=11111010101000010100000×2^(EXP-1)

3_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 01111101010100001001111 1 0 1×2^EXP
A-B-C shifted 1 bit to left
            = 11111010101000010011111 0 1 0×2^(EXP-1)
Rounded A-B-C=11111010101000010011111×2^(EXP-1)

4_Example 4

A-B = 101100101000111000110101 0 1 1×2^EXP

C = 100110010110011101100000×2^(EXP-7) --> 
C = 000000010011001011001110 1 1 0×2^EXP

4_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 101100101000111000110110×2^EXP
A-B-C       = 101100010101101101100111 0 1 0×2^EXP
Rounded A-B-C=101100010101101101100111×2^EXP

4_2_If A-B-C is calculated without A-B being rounded:

A-B-C         = 101100010101101101100111 1 0 1×2^EXP
Rounded A-B-C = 101100010101101101101000×2^EXP

5_Example 5

A-B = 100000111011001111001010 0 1 1×2^EXP

C = 110001011010010110010110×2^(EXP-3) --> 
C = 000110001011010010110010 1 1 0×2^EXP

5_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 100000111011001111001010×2^EXP
A-B-C       = 011010101111111100010111 0 1 0×2^EXP
A-B-C shifted 1 bit to left
            = 110101011111111000101110 1 0 0×2^(EXP-1)
Rounded A-B-C=110101011111111000101110×2^(EXP-1)

5_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 011010101111111100010111 1 0 1×2^EXP
A-B-C shifted 1 bit to left 
            = 110101011111111000101111 0 1 0×2^(EXP-1)
Rounded A-B-C=110101011111111000101111×2^(EXP-1)

6_Example 6

A-B = 100000000011000111001010 0 0 1×2^EXP
C = 110010001100110111000000×2^(EXP-8) -->
C = 000000001100100011001101 1 1 0×2^EXP

6_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 100000000011000111001010×2^EXP
A-B-C       = 011111110110100011111100 0 1 0×2^EXP
A-B-C shifted 1 bit to left
            = 111111101101000111111000 1 0 0×2^(EXP-1)
Rounded A-B-C=111111101101000111111000×2^(EXP-1)

6_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 011111110110100011111100 0 1 1×2^EXP
A-B-C shifted 1 bit to left
            = 111111101101000111111000 1 1 0×2^(EXP-1)
Rounded A-B-C=111111101101000111111001×2^(EXP-1)

Solution

  • According to the paper referenced in the question, it is possible to calculate the dot product of a pair of length N vectors with a single rounding operation at the end, getting the closest representable result to the dot product.

    In practice, current computers round intermediate results, which often results in an answer that is not the closest representable. At best, the rounding is to an extended format, which will tend to reduce, but not eliminate, intermediate result rounding error.

    With a fused multiply-add the rounding is done N times, once for each multiply-add. Without a fused multiply-add it is done twice for each pair, once after the multiply and again after the add.

    The BibTex information for the paper is:

    @INPROCEEDINGS{ yao:correctly,
    AUTHOR = "Tao Yao and Deyuan Gao and Xiaoya Fan and Jari Nurmi",
    TITLE = "Correctly rounded architectures for Floating-Point multi-operand addition and dot-product computation",
    BOOKTITLE = "ASAP'13",
    PAGES = {346-355},
    YEAR = {2013}, 
    }