I need to find out the exact binary representation for float
s and double
s in a C++ project built with Platformio for an Atmega328 using the Arduino framework. I don't have access to the actual hardware so I can't check it myself.
The micro does not have an FPU and is 8-bit so it's pretty much all up to the compiler (or framework's libraries?) - which in this case seems to be avr-gcc
, version 7.3. I've managed to get as far as the avr-gcc
documentation telling me that by default double
is represented the same way as a float
but does not specify what that actually is (the IEEE standard is only mentioned for an optional long double
).
So, the question is kinda twofold, really. Most importantly, I need to know what representation is the float in this particular case (I strongly suspect it's IEEE 754, but could use a confirmation). And secondly, I wonder where can find this information formally, as a part of some kind of official documentation.
In any case, the floating-point format is:
IEEE-754, binary, little-endian.
In the encoded form, respective parts of the representation will occupy:
32-Bit Floating-Point | 64-Bit Floating-Point | |
---|---|---|
Sign | 1 bit (31) | 1 bit (63) |
Biased Exponent | 8 bits (30−23) | 11 bits (62−52) |
Encoded Mantissa | 23 Bits (22−0) | 52 bits (51−0) |
Exponent Bias | 127 | 1023 |
sizeof | 4 | 8 |
NaNs are non-signalling.
Some of the properties are available as GCC built-in macros, for example for float
, run
> echo "" | avr-gcc -xc - -E -dM | grep _FL | sort
#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
...
#define __FLT_HAS_DENORM__ 1
#define __FLT_HAS_INFINITY__ 1
#define __FLT_HAS_QUIET_NAN__ 1
#define __FLT_MANT_DIG__ 24
#define __FLT_MAX_EXP__ 128
...
#define __FLT_MIN_EXP__ (-125)
#define __FLT_RADIX__ 2
#define __SIZEOF_FLOAT__ 4
For double
properties, grep for __DBL
or DOUBLE
.
Up to and including avr-gcc v9, we have float
= double
= long double
and all are 32 bits wide.
For avr-gcc v10 onwards: The size of double
depends on command line option -mdouble=[32|64]
, cf. avr-gcc command line options. The default and availability of this option depends on configure option --with-double=...
, cf. the GCC configure options for the AVR backend.
Similar applies to long double
and -mlong-double=
resp. --with-long-double=
.
Floating-point libraries do not support reduced tiny cores (-mmcu=avrtiny
).
64-bit floating-point support is incomplete for devices that don't support the MUL
instruction.
For computations on the host like constant folding, GCC uses MPFR.
32-bit floating point for the AVR target is implemented as part of AVR-LibC, even parts you'd usually expect in libgcc.
64-bit floating point for the AVR target is implemented as part of libgcc, even parts you'd usually expect in libm.
Some functions might not be 100% IEEE compliant. For example, IEEE requires that the result of functions like sin
is as if sin
was computed with infinite precision and then rounded according to the selected rounding mode. Due to efficiency considerations, some functions might return results with less precision than mandated by IEEE.