Search code examples
cdoublebit-manipulationieee-754mantissa

bitwise splitting the mantissa of a IEEE 754 double? how to access bit structure,


(sorry, I'm coming up with some funny ideas... bear with me...)

Let's say I have a 'double' value, consisting of:

                 implicit
sign exponent    bit         mantissa
0    10000001001 (1).0011010010101010000001000001100010010011011101001100

representing 1234.6565 if I'm right.

I'd like to be able to access the fields of sign, exponent, implicit and mantissa separately as bits!, and manipulate them with bitwise operations like AND, OR, XOR ... or string operations like 'left', mid etc.

And then I'd like to puzzle a new double together from the manipulated bits.

e.g. setting the sign bit to 1 would make the number negative, adding or subtracting 1 to/from the exponent would double/halve the value, stripping all bits behind the position indicated by the recalculated (unbiased) value of the exponent would convert the value to an integer and so on.

Other tasks would/could be to find the last set bit, calculate how much it contributes to the value, check if the last bit is '1' (binary 'odd') or '0' (binary 'even') and the like.

I have seen similar in programs, just can't find it on the fly. I may remember something with 'reinterpret cast' or similar? I think there are libraries or toolkits or 'howtos' around which offer access to such, and hope here are reading people who can point me to such.

I'd like a solution near simple processor instructions and simple C code. I'm working in Debian Linux and compiling with gcc which was in by default.

startpoint is any double value i can address as 'x',

startpoint 2 is I'm not! an experienced programmer :-(

How to do easy, and get it working with good performance?


Solution

  • This is straightforward, if a bit esoteric.

    Step 1 is to access the individual bits of a float or double. There are a number of ways of doing this, but the commonest are to use a char * pointer, or a union. For our purposes today let's use a union. [There are subtleties to this choice, which I'll address in a footnote.]

    union doublebits {
        double d;
        uint64_t bits;
    };
    
    union doublebits x;
    x.d = 1234.6565;
    

    So now x.bits lets us access the bits and bytes of our double value as a 64-bit unsigned integer. First, we could print them out:

    printf("bits: %llx\n", x.bits);
    

    This prints

    bits: 40934aa04189374c
    

    and we're on our way.

    The rest is "simple" bit manipulation. We'll start by doing it the brute-force, obvious way:

    int sign = x.bits >> 63;
    int exponent = (x.bits >> 52) & 0x7ff;
    long long mantissa = x.bits & 0xfffffffffffff;
    
    printf("sign = %d, exponent = %d, mantissa = %llx\n", sign, exponent, mantissa);
    

    This prints

    sign = 0, exponent = 1033, mantissa = 34aa04189374c
    

    and these values exactly match the bit decomposition you showed in your question, so it looks like you were right about the number 1234.6565.

    What we have so far are the raw exponent and mantissa values. As you know, the exponent is offset, and the mantissa has an implicit leading "1", so let's take care of those:

    exponent -= 1023;
    mantissa |= 1ULL << 52;
    

    (Actually this isn't quite right. Soon enough we're going to have to address some additional complications having to do with denormalized numbers, and infinities and NaNs.)

    Now that we have the true mantissa and exponent, we can do some math to recombine them, to see if everything is working:

    double check = (double)mantissa * pow(2, exponent);
    

    But if you try that, it gives the wrong answer, and it's because of a subtlety that, for me, is always the hardest part of this stuff: Where is the decimal point in the mantissa, really? (Actually, it's not a "decimal point", anyway, because we're not working in decimal. Formally it's a "radix point", but that sounds too stuffy, so I'm going to keep using "decimal point", even though it's wrong. Apologies to any pedants whom this rubs the wrong way.)

    When we did mantissa * pow(2, exponent) we assumed a decimal point, in effect, at the right end of the mantissa, but really, it's supposed to be 52 bits to the left of that (where that number 52 is, of course, the number of explicit mantissa bits). That is, our hexadecimal mantissa 0x134aa04189374c (with the leading 1 bit restored) is actually supposed to be treated more like 0x1.34aa04189374c. We can fix this by adjusting the exponent, subtracting 52:

    double check = (double)mantissa * pow(2, exponent - 52);
    printf("check = %f\n", check);
    

    So now check is 1234.6565 (plus or minus some roundoff error). And that's the same number we started with, so it looks like our extraction was correct in all respects.

    But we have some unfinished business, because for a fully general solution, we have to handle "subnormal" (also known as "denormalized") numbers, and the special representations inf and NaN.

    These wrinkles are controlled by the exponent field. If the exponent (before subtracting the bias) is exactly 0, this indicates a subnormal number, that is, one whose mantissa is not in the normal range of (decimal) 1.00000 to 1.99999. A subnormal number does not have the implicit leading "1" bit, and the mantissa ends up being in the range from 0.00000 to 0.99999. (This also ends up being the way the ordinary number 0.0 has to be represented, since it obviously can't have that implicit leading "1" bit!)

    On the other hand, if the exponent field has its maximum value (that is, 2047, or 211-1, for a double) this indicates a special marker. In that case, if the mantissa is 0, we have an infinity, with the sign bit distinguishing between positive and negative infinity. Or, if the exponent is max and the mantissa is not 0, we have a "not a number" marker, or NaN. The specific nonzero value in the mantissa can be used to distinguish between different kinds of NaN, like "quiet" and "signaling" ones, although it turns out the particular values that might be used for this aren't standard, so we'll ignore that little detail.

    (If you're not familiar with infinities and NaNs, they're what IEEE-754 says that certain operations are supposed to return when the proper mathematical result is, well, not an ordinary number. For example, sqrt(-1.0) returns NaN, and 1./0. typically gives inf. There's a whole set of IEEE-754 rules about infinities and NaNs, such as that atan(inf) returns π/2.)

    The bottom line is that instead of just blindly tacking on the implicit 1 bit, we have to check the exponent value first, and do things slightly differently depending on whether the exponent has its maximum value (indicating specials), an in-between value (indicating ordinary numbers), or 0 (indicating subnormal numbers):

    if(exponent == 2047) {
        /* inf or NAN */
        if(mantissa != 0)
             printf("NaN\n");
        else if(sign)
             printf("-inf\n");
        else printf("inf\n");
    } else if(exponent != 0) {
        /* ordinary value */
        mantissa |= 1ULL << 52;
    } else {
        /* subnormal */
        exponent++;
    }
    
    exponent -= 1023;
    

    That last adjustment, adding 1 to the exponent for subnormal numbers, reflects the fact that subnormals are "interpreted with the value of the smallest allowed exponent, which is one greater" (per the Wikipedia article on subnormal numbers).

    I said this was all "straightforward, if a bit esoteric", but as you can see, while extracting the raw mantissa and exponent values is indeed pretty straightforward, interpreting what they actually mean can be a challenge!


    If you already have raw exponent and mantissa numbers, going back in the other direction — that is, constructing a double value from them — is just about as straightforward:

    sign = 1;
    exponent = 1024;
    mantissa = 0x921fb54442d18;
    
    x.bits = ((uint64_t)sign << 63) | ((uint64_t)exponent << 52) | mantissa;
    
    printf("%.15f\n", x.d);
    

    This answer is getting too long, so for now I'm not going to delve into the question of how to construct appropriate exponent and mantissa numbers from scratch for an arbitrary real number. (Me, I usually do the equivalent of x.d = atof(the number I care about), and then use the techniques we've been discussing so far.)


    Your original question was about "bitwise splitting", which is what we've been discussing. But it's worth noting that there's a much more portable way to do all this, if you don't want to muck around with raw bits, and if you don't want/need to assume that your machine uses IEEE-754. If you just want to split a floating-point number into a mantissa and an exponent, you can use the standard library frexp function:

    int exp;
    double mant = frexp(1234.6565, &exp);
    printf("mant = %.15f, exp = %d\n", mant, exp);
    

    This prints

    mant = 0.602859619140625, exp = 11
    

    and that looks right, because 0.602859619140625 × 211 = 1234.6565 (approximately). (How does it compare to our bitwise decomposition? Well, our mantissa was 0x34aa04189374c, or 0x1.34aa04189374c, which in decimal is 1.20571923828125, which is twice the mantissa that ldexp just gave us. But our exponent was 1033 - 1023 = 10, which is one less, so it comes out in the wash: 1.20571923828125 × 210 = 0.602859619140625 × 211 = 1234.6565.)

    There's also a function ldexp that goes in the other direction:

    double x2 = ldexp(mant, exp);
    printf("%f\n", x2);
    

    This prints 1234.656500 again.


    Footnote: When you're trying to access the raw bits of something, as of course we've been doing here, there are some lurking portability and correctness questions having to do with something called strict aliasing. Strictly speaking, and depending on who you ask, you may need to use an array of unsigned char as the other part of your union, not uint64_t as I've been doing here. And there are those who say that you can't portably use a union at all, that you have to use memcpy to copy the bytes into a completely separate data structure, although I think they're taking about C++, not C.