Search code examples
c++floating-pointavr-gcc

Where to find information about the exact binary representation of floating point values used by avr-gcc when compiling for 8-bit processors?


I need to find out the exact binary representation for floats and doubles in a C++ project built with Platformio for an Atmega328 using the Arduino framework. I don't have access to the actual hardware so I can't check it myself.

The micro does not have an FPU and is 8-bit so it's pretty much all up to the compiler (or framework's libraries?) - which in this case seems to be avr-gcc, version 7.3. I've managed to get as far as the avr-gcc documentation telling me that by default double is represented the same way as a float but does not specify what that actually is (the IEEE standard is only mentioned for an optional long double).

So, the question is kinda twofold, really. Most importantly, I need to know what representation is the float in this particular case (I strongly suspect it's IEEE 754, but could use a confirmation). And secondly, I wonder where can find this information formally, as a part of some kind of official documentation.


Solution

  • Floating-Point Format

    In any case, the floating-point format is:

    IEEE-754, binary, little-endian.

    In the encoded form, respective parts of the representation will occupy:

    32-Bit Floating-Point 64-Bit Floating-Point
    Sign 1 bit (31) 1 bit (63)
    Biased Exponent 8 bits (30−23) 11 bits (62−52)
    Encoded Mantissa 23 Bits (22−0) 52 bits (51−0)
    Exponent Bias 127 1023
    sizeof 4 8

    NaNs are non-signalling.

    Some of the properties are available as GCC built-in macros, for example for float, run

    > echo "" | avr-gcc -xc - -E -dM | grep _FL | sort
    
    #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
    ...
    #define __FLT_HAS_DENORM__ 1
    #define __FLT_HAS_INFINITY__ 1
    #define __FLT_HAS_QUIET_NAN__ 1
    #define __FLT_MANT_DIG__ 24
    #define __FLT_MAX_EXP__ 128
    ...
    #define __FLT_MIN_EXP__ (-125)
    #define __FLT_RADIX__ 2
    #define __SIZEOF_FLOAT__ 4
    

    For double properties, grep for __DBL or DOUBLE.

    Floating-Point Availability

    • Up to and including avr-gcc v9, we have float = double = long double and all are 32 bits wide.

    • For avr-gcc v10 onwards: The size of double depends on command line option -mdouble=[32|64], cf. avr-gcc command line options. The default and availability of this option depends on configure option --with-double=..., cf. the GCC configure options for the AVR backend.

      Similar applies to long double and -mlong-double= resp. --with-long-double=.

    • Floating-point libraries do not support reduced tiny cores (-mmcu=avrtiny).

    • 64-bit floating-point support is incomplete for devices that don't support the MUL instruction.

    Floating-Point Implementation

    • For computations on the host like constant folding, GCC uses MPFR.

    • 32-bit floating point for the AVR target is implemented as part of AVR-LibC, even parts you'd usually expect in libgcc.

    • 64-bit floating point for the AVR target is implemented as part of libgcc, even parts you'd usually expect in libm.

    • Some functions might not be 100% IEEE compliant. For example, IEEE requires that the result of functions like sin is as if sin was computed with infinite precision and then rounded according to the selected rounding mode. Due to efficiency considerations, some functions might return results with less precision than mandated by IEEE.