How does 'long double' occupy 16 bytes / 128 bits of storage when my CPU (X64 ISA)?

I'm a novice C programmer; this might be a naive question to ask but please bear with me. I learnt that the storage size of char, short, int, long, long long, float, double, and long double are implementation dependent and can vary greatly. When I used sizeof(long) on my machine, it returned me 8 bytes, but when used the same function upon long long it returned 8 bytes as well. Therefore I thought, since my CPU is based on X64 ISA, it might only support 64 bits / 8 bytes at max. But when I used the same function for long double the result was 16 bytes.

How come long double can be 16 bytes wide but not long long? If long long is same as long what's the point of having it? And one more question, how does my CPU truly handle long double with 128 bits wide when the registers' size of my CPU is 64 bits?

Additional Information: I use gcc (GCC) 12.2.0, if it has something to do with the compiler.

#include <stdio.h>

int main(void) {
    printf("%d\n", sizeof(long));
    printf("%d\n", sizeof(long long));
    printf("%d\n", sizeof(long double));

    return 0;
}

The result was as follow:

8
8
16

Solution

In the x86-64 System V ABI (used on Linux, BSD, macOS, etc.), long double is the 80-bit x87 format (same as in i386 System V). The actual data takes 10 bytes, the rest is padding for alignment.

(In standard Windows x64, long double uses the same format as double, IEEE binary64, so there's no access to the legacy HW support for extended precision FP, only stuff that can be done with SSE / SSE2. GCC allows -mlong-double-80, and that's actually the default for GCC so it defaults to not being ABI-compatible with MSVC for long double!! See also Did any compiler fully use Intel x87 80-bit floating point? on retrocomputing.SE.)

The x86-64 ABI designers elected to increase its alignof to 16 so it could be more efficiently copied around (e.g. with SSE movaps), as well as maybe more efficiency in actual x87 fld / fstp avoiding cache-line splits¹.

i386 System V used alignof(long double) = 4, so sizeof(long double) == 12 with gcc -m32 on x86-64 / i386 GNU/Linux. In that case just 2 bytes of padding. sizeof(T) must be a multiple of alignof(T), and must be large enough to hold all the value bits.

Actual 386 CPUs didn't care about alignments greater than 4 so this wasn't a problem for them when the ABI was designed. But P6 and K7 CPUs did care (which were current when x86-64 SysV was being designed), and had worse misalignment penalties than modern CPUs. (See Why does Windows64 use a different calling convention from all other OSes on x86-64? for some links to archives of the x86-64.org mailing list where this was discussed, along with other design decisions.)

FP long double is completely unrelated to integer long long. C requires that to be at least 64 bits wide (actually a value-range requirement but it's also required to be binary). 64 bit integers are a natural fit for x86-64, being the width of an integer register. Making long long even wider would have slowed down any code that used it (requiring add/adc for +, and widening mul + 2x non-widening imul for multiplication.)

long is also 64 bits in x86-64 System V, but C only requires it to be at least 32 bits (which is what Windows x64 chose, so this is a real-world portability problem if you want a type wide enough to hold pointers for example).

A lot of older code uses long long for variables that need to be at least 64-bit but which don't benefit from being 128-bit or wider. (Some newer code uses int_least64_t or int_fast64_t for that, although many of the narrower fast types are not good for most purposes due to bad choices on mainstream platforms like x86-64 GNU/Linux - int_fast8_t size vs int_fast16_t size on x86-64 platform - it makes them all 64-bit, except for fast8, wasting huge amounts of space and cache footprint in arrays or structs.)

So there's an expectation that types like long and long long are chosen to be efficient sizes for the target ISA, not extended-precision (bigger than a register) unless necessary to meet the minimum value-range requirements in the C standard.

This system of int / long / long long is not the most usable for writing portable efficient code. Once the computing landscape settled down to machines with 8-bit bytes and typically 32 or 64-bit registers, C introduces types like int32_t and int64_t (optional, but if present are also required to be 2's complement, not 1's or sign/magnitude). And now in C23, _BitInt(128). So if you know 32 or 64-bit is enough, you can ask for that size explicitly.

Footnote 1: IDK if cache line splits for 80-bit x87 are a real problem on modern CPUs. I think the 2 load-port uops of fld on modern Intel are probably a 64-bit load of the mantissa and a 16-bit load of the exponent+sign, with the other uops being needed to put them together.

So alignof(long double) == 8 might have been sufficient today, but that would still make the size 16. Cache-line splits could still have two cache misses, making random access to a big cold array worse than the actual design.

But wasting 6 of every 16 bytes on padding makes cache footprint worse, and memory bandwidth for sequential access. So no single ABI choice is optimal for all use-cases.