How to refine floating-point division on FMA-capable GPUs?

When writing computational code for GPUs using APIs where compute shaders are translated via SPIR-V (in particular, Vulkan), I am guaranteed that ULP error of floating-point division will be at most 3. Other basic arithmetic (addition, multiplication) is correctly rounded.

How to efficiently implement correctly rounded division in these circumstances? Let's assume that FMA is available and correctly rounded.

What happens with denormals will depend on the underlying hardware. Vulkan API allows to query whether the device can preserve denormals and whether it can flush them to zero (so a GPU that does not fully support denormals will have "canPreserve: false, canFlush: true"). So let's additionally assume that the GPU can produce and handle denormals, without flushing them to zero (otherwise trying to produce a correctly rounded result which is subnormal seems futile).

Solution

Modern FMA-based division algorithms typically rely heavily on the work of Peter Markstein who worked on this issue extensively. The canonical reference is:

Peter Markstein, "IA-64 and Elementary Functions: Speed and Precision", Prentice-Hall 2000.

The C code below is also based on Markstein's work. It basically consists of a fast path that handles common cases as quickly as possible, and a slow path that handles all the pesky special cases.

The concepts involved are pretty straightforward, but they do require careful mathematical analysis to ensure a correctly rounded result. The first step is to compute an accurate reciprocal of the divisor. This requires the accurate computation of the approximation error, for which FMA is the best tool. The refinement of the reciprocal from a (hardware-based) approximation typically takes the form of a single Newton-Raphson iteration with quadratic convergence or a single Halley iteration with cubic convergence, both of which maps perfectly to FMA.

Multiplying the reciprocal of the divisor with the dividend then gives us an approximation to the quotient. This is likewise refined based on the FMA-based computation of the residual. At this stage one typically works with a version of the dividend scaled towards unity to guard against overflow and underflow in intermediate computation and allow for the widest possible range of operands in the fast path of the division algorithm. This also means that there is a need for a final scaling step at the end to adjust the quotient magnitude.

An advantageous arrangement of the two paths through the code is to first unconditionally perform the fast-path computation, then check at the end whether the conditions for fast-path usage are met, and if not, invoked the slow path code. This allows for the computation of necessary conditions concurrent with the fast-path computation, and allows the outlining of the slow-path computation, which facilitates placing that code in a rarely used memory page where various slow-path or exception handling code is collected.

Please treat the code below as an algorithmic sketch. The included "smoke"-level test based on random test vectors is far from a rigorous test framework. The conditions for invoking the slow path have neither been shown to be as tight as possible, nor as tight as necessary, nor the most efficient arrangement. The refinement of the reciprocal uses a single Newton-Raphson step but that may not be sufficient for a given GPU's reciprocal approximation, and a Halley iteration may need to be substituted at the cost of an additional FMA. Finally, I have omitted the construction of the entire slow path as that would exceed the limitations of a Stackoverflow answer, both in terms of the design effort and the amount of textual description.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

float rcp_approx (float a)
{
    return 1.0f / a; // fake it for now
}

float fp32_div_slowpath (float dividend, float divisor)
{
    return dividend / divisor; // fake it for now
}

float fp32_div (float dividend, float divisor)
{
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_ONE = 0x3f800000u;
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_SIGN_EXPO_MASK = 0xff800000u;
    const uint32_t FP32_HI_LIMIT = 0x7e800000u; // 0x1.0p+126
    const uint32_t FP32_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp

    // fast path
    float recip_approx = rcp_approx (divisor); 
    float recip_err = fmaf (recip_approx, -divisor, 1.0f);
    float dividend_mant = uint32_as_float ((float_as_uint32 (dividend) & FP32_MANT_MASK) | FP32_ONE);
    float dividend_scale = uint32_as_float (float_as_uint32 (dividend) & FP32_SIGN_EXPO_MASK);
    float refined_recip = fmaf (recip_approx, recip_err, recip_approx);
    float quotient_mant = refined_recip * dividend_mant;
    float quotient_residual = fmaf (quotient_mant, -divisor, dividend_mant);
    float refined_quotient_mant = fmaf (refined_recip, quotient_residual, quotient_mant);
    float refined_quotient_residual = fmaf (refined_quotient_mant, -divisor, dividend_mant);
    float final_quotient_mant = fmaf (refined_recip, refined_quotient_residual, refined_quotient_mant);
    float final_quotient = final_quotient_mant * dividend_scale;

    // check if we need to apply the slow path and invoke it if that is the case
    uint32_t id = float_as_uint32 (divisor) & ~FP32_SIGN_MASK;
    uint32_t iq = float_as_uint32 (final_quotient) & ~FP32_SIGN_MASK;
    if ((id > FP32_HI_LIMIT) || ((iq - FP32_LO_LIMIT) > (FP32_HI_LIMIT - FP32_LO_LIMIT))) {
        final_quotient = fp32_div_slowpath (dividend, divisor);
    }

    return final_quotient;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    uint64_t count = 0;
    float divisor, dividend, res, ref;

    do {
        if (!(count & 0xffffffULL)) printf ("\rcount: %llu", count);

        dividend = uint32_as_float (KISS);
        divisor  = uint32_as_float (KISS);
        res = fp32_div (dividend, divisor);
        ref = dividend / divisor;
        if (float_as_uint32 (res) != float_as_uint32 (ref)) {
            printf ("error @ dividend=% 16.6a divisor=% 16.6a (% 15.8e, % 15.8e): res=% 16.6a ref=% 16.6a (% 15.8e, % 15.8e)\n", 
                    dividend, divisor, dividend, divisor, res, ref, res, ref);
            return EXIT_FAILURE;
        }

        count++;
    } while (count);
    return EXIT_SUCCESS;
}

Some additional notes after re-acquainting myself with the issue. It is advantageous if the hardware's reciprocal approximation instruction treats subnormals as zero. This allows the elimination of the divisor magnitude check in the invocation of the slow-path code, as any divisor greater than 2¹²⁶ will result in a reciprocal of zero and a quotient of infinity, thus triggering the quotient check.

As for the quotient check, any fast-path result that represents a normal can be accepted, except for the smallest normal which may represent an incorrect rounding from a subnormal result. In other words, quotients whose absolute value are in [0x1.000002p-126, 0x1.fffffep+127] should not trigger re-computation by slow-path code.

There is no need for a quotient-refinement step as long as a sufficiently accurate reciprocal is used. An exhaustive test that covers all dividends in [1,2] and all divisors in [1,2] and thus all possible combinations of significand (mantissa) patterns is feasible with modern hardware, requiring 2⁴⁶ test cases. Even with scalar code running on a single CPU core it completed in less than four days, with no errors reported.

In practical usage, where possible, one would want to force the compiler to inline fp32_div() while forcing fp_div_slowpath() into a called subroutine.

Below I prototyped an AVX-based version of single-precision division using the simplifications discussed above. It passed all my tests. Since the accuracy of the AVX-based hardware reciprocal is low, a Halley iteration is required to refine the reciprocal to full single-precision, delivering results with a maximum error close to 0.5 ulp.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include "immintrin.h"

#define MODE_RANDOM           (1)
#define MODE_MANT_EXHAUSTIVE  (2)
#define TEST_MODE             (MODE_RANDOM)

float uint32_as_float (uint32_t a) { float r; memcpy (&r, &a, sizeof r); return r; }
uint32_t float_as_uint32 (float a) { uint32_t r; memcpy (&r, &a, sizeof r); return r; }
float fp32_div_slowpath (float dividend, float divisor) { return dividend / divisor; }

/* 12-bit approximation. Subnormal arguments and results are treated as zero */
float rcp_approx_avx (float divisor)
{
    float r;
    __m256 t = _mm256_set1_ps (divisor);
    t = _mm256_rcp_ps (t);
    memcpy (&r, &t, sizeof r);
    return r;
}

float fp32_div (float dividend, float divisor)
{
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_ONE = 0x3f800000u;
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_SIGN_EXPO_MASK = 0xff800000u;
    const uint32_t FP32_QUOTIENT_HI_LIMIT = 0x7f7fffffu; // 0x1.0p+128 - ulp
    const uint32_t FP32_QUOTIENT_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp

    // fast path
    float recip_approx = rcp_approx_avx (divisor);
    float recip_err = fmaf (recip_approx, -divisor, 1.0f);
    float recip_err2 = fmaf (recip_err, recip_err, recip_err);
    float refined_recip = fmaf (recip_approx, recip_err2, recip_approx);
    float dividend_mant = uint32_as_float ((float_as_uint32 (dividend) & FP32_MANT_MASK) | FP32_ONE);
    float dividend_scale = uint32_as_float (float_as_uint32 (dividend) & FP32_SIGN_EXPO_MASK);
    float refined_quotient_mant = refined_recip * dividend_mant;
    float refined_quotient_residual = fmaf (refined_quotient_mant, -divisor, dividend_mant);
    float final_quotient_mant = fmaf (refined_recip, refined_quotient_residual, refined_quotient_mant);
    float final_quotient = final_quotient_mant * dividend_scale;

    // check if we need to apply the slow path and invoke it if that is the case
    uint32_t iq = float_as_uint32 (final_quotient) & ~FP32_SIGN_MASK;
    if ((iq - FP32_QUOTIENT_LO_LIMIT) > (FP32_QUOTIENT_HI_LIMIT - FP32_QUOTIENT_LO_LIMIT)) {
        final_quotient = fp32_div_slowpath (dividend, divisor);
    }
    return final_quotient;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static unsigned int z=362436069,w=521288629,jsr=362436069,jcong=123456789;
#define znew (z=36969*(z&0xffff)+(z>>16))
#define wnew (w=18000*(w&0xffff)+(w>>16))
#define MWC  ((znew<<16)+wnew)
#define SHR3 (jsr^=(jsr<<13),jsr^=(jsr>>17),jsr^=(jsr<<5)) /* 2^32-1 */
#define CONG (jcong=69069*jcong+13579)                     /* 2^32 */
#define KISS ((MWC^CONG)+SHR3)

int main (void)
{
    uint64_t count = 0;
    float divisor, dividend, res, ref;

#if TEST_MODE == MODE_RANDOM
    do {
        if (!(count & 0xffffffULL)) printf ("\rcount: %llu", count);

        dividend = uint32_as_float (KISS);
        divisor  = uint32_as_float (KISS);
        res = fp32_div (dividend, divisor);
        ref = dividend / divisor;
        if (float_as_uint32 (res) != float_as_uint32 (ref)) {
            printf ("error @ dividend=% 16.6a divisor=% 16.6a (% 15.8e, % 15.8e): res=% 16.6a ref=% 16.6a (% 15.8e, % 15.8e)\n", 
                    dividend, divisor,  dividend, divisor, res, ref, res, ref);
            return EXIT_FAILURE;
        }

        count++;
    } while (count);
#else
    for (dividend = 1.0f; dividend <= 2.0f; dividend = uint32_as_float (float_as_uint32 (dividend) + 1)) {
        printf ("\rdividend=%08x", float_as_uint32 (dividend));
        for (divisor = 1.0f; divisor <= 2.0f; divisor = uint32_as_float (float_as_uint32 (divisor) + 1)) {
            res = fp32_div (dividend, divisor);
            ref = dividend / divisor;
            if (float_as_uint32 (res) != float_as_uint32 (ref)) {
                printf ("error @ dividend=% 16.6a divisor=% 16.6a (% 15.8e, % 15.8e): res=% 16.6a ref=% 16.6a (% 15.8e, % 15.8e)\n", 
                        dividend, divisor,  dividend, divisor, res, ref, res, ref);
                return EXIT_FAILURE;
            }
        }
    }
#endif

    return EXIT_SUCCESS;
}

Corresponding CUDA code (tested) for an NVIDIA GPU looks as follows:

__noinline__ __device__ float fp32_div_slowpath (float dividend, float divisor) 
{ 
    return dividend / divisor; 
}

/* Subnormal arguments and results are treated as zero */
__forceinline__ __device__ float rcp_approx_gpu (float divisor)
{
    float r;
    asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(divisor));
    return r;
}

__forceinline__ __device__ float fp32_div (float dividend, float divisor)
{
    const uint32_t FP32_MANT_MASK = 0x007fffffu;
    const uint32_t FP32_ONE = 0x3f800000u;
    const uint32_t FP32_SIGN_MASK = 0x80000000u;
    const uint32_t FP32_SIGN_EXPO_MASK = 0xff800000u;
    const uint32_t FP32_QUOTIENT_HI_LIMIT = 0x7f7fffffu; // 0x1.0p+128 - ulp
    const uint32_t FP32_QUOTIENT_LO_LIMIT = 0x00800001u; // 0x1.0p-126 + ulp

    // fast path
    float recip_approx = rcp_approx_gpu (divisor);
    float recip_err = fmaf (recip_approx, -divisor, 1.0f);
    float refined_recip = fmaf (recip_approx, recip_err, recip_approx);
    float dividend_mant = __int_as_float ((__float_as_int (dividend) & FP32_MANT_MASK) | FP32_ONE);
    float dividend_scale = __int_as_float (__float_as_int (dividend) & FP32_SIGN_EXPO_MASK);
    float refined_quotient_mant = refined_recip * dividend_mant;
    float refined_quotient_residual = fmaf (refined_quotient_mant, -divisor, dividend_mant);
    float final_quotient_mant = fmaf (refined_recip, refined_quotient_residual, refined_quotient_mant);
    float final_quotient = final_quotient_mant * dividend_scale;

    // check if we need to apply the slow path and invoke it if that is the case
    uint32_t iq = __float_as_int (final_quotient) & ~FP32_SIGN_MASK;
    if ((iq - FP32_QUOTIENT_LO_LIMIT) > (FP32_QUOTIENT_HI_LIMIT - FP32_QUOTIENT_LO_LIMIT)) {
        final_quotient = fp32_div_slowpath (dividend, divisor);
    }
    return final_quotient;
}