Why are denormal floating-point values slower to handle?

It is generally the case that floating-point values that consume or produce denormals, are slower than otherwise, sometimes much slower.

Why is this the case? If it's because they trap to software instead of being handled directly in hardware, as is said to be so on some CPUs, why do they have to do that?

Solution

With IEEE-754 floating-point most operands encountered are normalized floating-point numbers, and internal data paths in processors are built for normalized operands. Additional exponent bits may be used for internal representations to keep floating-point operands normalized inside the data path at all times.

Any subnormal inputs therefore require additional work to first determine the number of leading zeros to then left shift the significand for normalization while adjusting the exponent. A subnormal result requires right shifting the significand by the appropriate amount and rounding may need to be deferred until after that has happened.

If solved purely in hardware, this additional work typically requires additional hardware and additional pipeline stages: One, maybe even two, additional clock cycles each for handling subnormal inputs and subnormal outputs. But the performance of typical CPUs is sensitive to the latency of instructions, and significant effort is expended to keep latencies low. The latency of an FADD, FMUL, or FMA instruction is typically between 3 to 6 cycles depending on implementation and frequency targets.

Adding, say, 50% additional latency for the potential handling of subnormal operands is therefore unattractive, even more so because subnormal operands are rare for most use cases. Using the design philosophy of "make the common case fast, and the uncommon case functional" there is therefore a significant incentive to push the handling of subnormal operands out of the "fast path" (pure hardware) into a "slow path" (combination of existing hardware plus software).

I have participated in the design of floating-point units for x86 processors, and the common approach for handling subnormals is to invoke an internal micro-code level exception when these need to be handled. This subnormal handling may take on the order of 100 clock cycles. The most expensive part of that is typically not the execution of the fix-up code itself, but getting in and out of the microcode exception handler.

I am aware of specific use cases, for example particular filters in digital signal processing, where encountering subnormals is common. To support such applications at speed, many floating-point units support a non-standard flush-to-zero mode in which subnormal encodings are treated as zero.

Note that there are throughput-oriented processor designs with significant latency tolerance, in particular GPUs. I am familiar with NVIDIA GPUs, and best I can tell they handle subnormal operands without additional overhead and have done so for the past dozen years or so. Presumably this comes at the cost of additional pipeline stages, but the vendor does not document many of the microarchitectural details of these processors, so it is hard to know for sure. The following paper may provide some general insights how different hardware designs handle subnormal operands, some with very little overhead:

E.M. Schwarz, M. Schmookler, and S.D. Trong, "FPU implementations with denormalized numbers." IEEE Transactions on Computers, Vol. 54, No. 7, July 2005, pp. 825 - 836