Where to find information about the exact binary representation of floating point values used by avr-gcc when compiling for 8-bit processors?

I need to find out the exact binary representation for floats and doubles in a C++ project built with Platformio for an Atmega328 using the Arduino framework. I don't have access to the actual hardware so I can't check it myself.

The micro does not have an FPU and is 8-bit so it's pretty much all up to the compiler (or framework's libraries?) - which in this case seems to be avr-gcc, version 7.3. I've managed to get as far as the avr-gcc documentation telling me that by default double is represented the same way as a float but does not specify what that actually is (the IEEE standard is only mentioned for an optional long double).

So, the question is kinda twofold, really. Most importantly, I need to know what representation is the float in this particular case (I strongly suspect it's IEEE 754, but could use a confirmation). And secondly, I wonder where can find this information formally, as a part of some kind of official documentation.

Solution

Floating-Point Format

In any case, the floating-point format is:

IEEE-754, binary, little-endian. See also avr-gcc Wiki: Type Layout.

In the encoded form, respective parts of the representation will occupy:

	32-Bit Floating-Point	64-Bit Floating-Point
Sign	1 bit (31)	1 bit (63)
Biased Exponent	8 bits (30−23)	11 bits (62−52)
Encoded Mantissa	23 Bits (22−0)	52 bits (51−0)
Exponent Bias	127	1023
sizeof	4	8

NaNs are non-signalling.

Some of the properties are available as GCC built-in macros, for example for float, run

> echo "" | avr-gcc -xc - -E -dM | grep _FL | sort

#define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
...
#define __FLT_HAS_DENORM__ 1
#define __FLT_HAS_INFINITY__ 1
#define __FLT_HAS_QUIET_NAN__ 1
#define __FLT_MANT_DIG__ 24
#define __FLT_MAX_EXP__ 128
...
#define __FLT_MIN_EXP__ (-125)
#define __FLT_RADIX__ 2
#define __SIZEOF_FLOAT__ 4

For double properties, grep for __DBL or DOUBLE.

Floating-Point Availability

Up to and including avr-gcc v9, we have float = double = long double and all are 32 bits wide.
For avr-gcc v10 onwards: The size of double depends on command line option -mdouble=[32|64], cf. avr-gcc command line options. The default and availability of this option depends on configure option --with-double=..., cf. the GCC configure options for the AVR backend.

Similar applies to long double and -mlong-double= resp. --with-long-double=.
Floating-point libraries do not support reduced tiny cores (-mmcu=avrtiny).
64-bit floating-point support is incomplete for devices that don't support the MUL instruction.

Floating-Point Implementation

For computations on the host like constant folding, GCC uses MPFR.
32-bit floating point for the AVR target is implemented as part of AVR-LibC, even parts you'd usually expect in libgcc.
64-bit floating point for the AVR target is implemented as part of libgcc, even parts you'd usually expect in libm.
Some functions might not be 100% IEEE compliant. For example, IEEE requires that the result of functions like sin is as if sin was computed with infinite precision and then rounded according to the selected rounding mode. Due to efficiency considerations, some functions might return results with less precision than mandated by IEEE.