Subnormal numbers in different precisions with MPFR

I would like to emulate various n-bit binary floating-point formats, each with a specified e_max and e_min, with p bits of precision. I would like these formats to emulate subnormal numbers, faithful to the IEEE-754 standard.

Naturally, my search has lead me to the MPFR library, being IEEE-754 compliant and able to support subnormals with the mpfr_subnormalize() function. However, I've ran into some confusion using mpfr_set_emin() and mpfr_set_emax() to correctly set up a subnormal-enabled environment. I will use IEEE double-precision as an example format, since this is the example used in the MPFR manual:

http://mpfr.loria.fr/mpfr-current/mpfr.html#index-mpfr_005fsubnormalize

mpfr_set_default_prec (53);
mpfr_set_emin (-1073); mpfr_set_emax (1024);

The above code is from the MPFR manual in the above link - note that neither e_max nor e_min are equal to the expected values for double. Here, p is set to 53, as expected of the double type, but e_max is set to 1024, rather than the correct value of 1023, and e_min is set to -1073; well below the correct value of -1022. I understand that setting the exponent bounds too tightly results in overflow/underflow in intermediate computations in MPFR, but I have found that setting e_min exactly is critical for ensuring correct subnormal numbers; too high or too low causes a subnormal MPFR result (updated with mprf_subnormalize()) to differ from the corresponding double result.

My question is how should one decide which values to pass to mpfr_set_emax() and (especially) mpfr_set_emin(), in order to guarantee correct subnormal behaviour for a floating-point format with exponent bounds e_max and e_min? There doesn't seem to be any detailed documentation or discussion on the matter.

With all my thanks,

James.

EDIT 30/07/16: Here is a small program which demonstrates the choice of e_max and e_min for single-precision numbers.

#include <iostream>
#include <cmath>
#include <float.h>
#include <mpfr.h>

using namespace std;

int main (int argc, char *argv[]) {
    cout.precision(120);

    // IEEE-754 float emin and emax values don't work at all
    //mpfr_set_emin (-126);
    //mpfr_set_emax (127);

    // Not quite
    //mpfr_set_emin (-147);
    //mpfr_set_emax (128);

    // Not quite
    //mpfr_set_emin (-149);
    //mpfr_set_emax (128);

    // These float emin and emax values work in subnormal range
    mpfr_set_emin (-148);
    mpfr_set_emax (128);

    cout << "emin: " << mpfr_get_emin() << "    emax: " << mpfr_get_emax() << endl;

    float f = FLT_MIN;
    for (int i = 0; i < 3; i++) f = nextafterf(f, INFINITY);

    mpfr_t m;
    mpfr_init2 (m, 24);
    mpfr_set_flt (m, f, MPFR_RNDN);

    for (int i = 0; i < 6; i++) {
        f = nextafterf(f, 0);
        mpfr_nextbelow(m);
        cout << i << ": float: " << f << endl;
        //cout << i << ":  mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
        mpfr_subnormalize (m, 1, MPFR_RNDN);
        cout << i << ":  mpfr: " << mpfr_get_flt (m, MPFR_RNDN) << endl;
    }

    mpfr_clear (m);
    return 0;
}

Solution

I'm copying my answer I gave on ResearchGate (with a link to the mpfr_subnormalize documentation):

There are different conventions to express significands and the associated exponents. IEEE 754 chooses to consider significands between 1 and 2, while MPFR (like the C language, see DBL_MAX_EXP for instance) chooses to consider significands between 1/2 and 1 (for practical reasons related to multiple precision). For instance, the number 17 is represented as 1.0001·2⁴ in IEEE 754 and as 0.10001·2⁵ in MPFR. As you can see, this means that exponents are increased by 1 in MPFR compared to IEEE 754, hence e_max = 1024 instead of 1023 for double precision.

Concerning the choice of e_min for double precision, one needs to be able to represent 2⁻¹⁰⁷⁴ = 0.1·2⁻¹⁰⁷³, so that e_min needs to be at most −1073 (as in MPFR, all numbers are normalized).

As documented, the mpfr_subnormalize function considers that the subnormal exponent range is from e_min to e_min + PREC(x) − 1, so that for instance, you need to set e_min = −1073 to emulate IEEE double precision.