Search code examples
c#floating-point-precisionfloating-point-conversion

Extended precision floating point dangers in C#


I am writing a library for multiprecision arithmetic based on a paper I am reading. It is very important that I am able to guarantee the properties of floating point numbers I use. In particular, that they adhere to the IEEE 754 standard for double precision floating point numbers. Clearly I cannot guarantee the behavior of my code on an unexpected platform, but for x86 and x64 chipsets, which I am writing for, I am concerned about a particular hazard. Apparently, some or all x86 / x64 chipsets may make use of extended precision floating point numbers in their FPU registers, with 80 bits of precision. I cannot tolerate my arithmetic being handled in extended precision FPUs without being rounded to double precision after every operation because the proofs of correctness for the algorithms I am using rely on rounding to occur. I can easily identify cases in which extended precision could break these algorithms.

I am writing my code in C#. How can I guarantee certain values are rounded? In C, I would declare variables as volatile, forcing them to be written back to RAM. This is slow and I'd rather keep the numbers in registers as 64 bit floats, but correctness in these algorithms is the whole point, not speed. In any case, I need a solution for C#. If this seems in-feasible I will approach the problem in a different language.


Solution

  • The C# spec has this to say on the topic:

    Only at excessive cost in performance can such hardware architectures be made to perform floating-point operations with less precision, and rather than require an implementation to forfeit both performance and precision, C# allows a higher precision type to be used for all floating-point operations. Other than delivering more precise results, this rarely has any measurable effects.

    As a result, third-party libraries are required to simulate the behavior of a IEEE 754-compliant FPU. One such is SoftFloat, which creates a type SoftFloat that uses operator overloads to simulate a standard double behavior.