Loading and storing long doubles in x86-64

I noticed a weird thing today. When copying a long double¹ all of gcc, clang and icc generate fld and fstp instructions, with TBYTE memory operands.

That is, the following function:

void copy_prim(long double *dst, long double *src) {
    *src = *dst;
}

Generates the following assembly:

copy_prim(long double*, long double*):
  fld TBYTE PTR [rdi]
  fstp TBYTE PTR [rsi]
  ret

Now according to Agner's tables this is a poor choice for performance, as fld takes four uops (none fused) and fstp takes a whopping seven uops (none fused) versus say a single fused uop each for movaps to/from an xmm register.

Interestingly, clang starts using movaps as soon as you put the long double in a struct. The following code:

struct long_double {
    long double x;
};

void copy_ld(long_double *dst, long_double *src) {
    *src = *dst;
}

Compiles to the same assembly with fld/fstp as previously shown for gcc and icc but clang now uses:

copy_ld(long_double*, long_double*):
  movaps xmm0, xmmword ptr [rdi]
  movaps xmmword ptr [rsi], xmm0
  ret

Oddly, if you stuff an additional int member into the struct (which doubles its size to 32 bytes due to alignment), all compilers generate SSE-only copy code:

copy_ldi(long_double_int*, long_double_int*):
  movdqa xmm0, XMMWORD PTR [rdi]
  movaps XMMWORD PTR [rsi], xmm0
  movdqa xmm0, XMMWORD PTR [rdi+16]
  movaps XMMWORD PTR [rsi+16], xmm0
  ret

Is there any functional reason to copy floating point values with fld and fstp or is just a missed optimization?

¹ Although a long double (i.e., x86 extended precision float) is nominally 10 bytes on x86, it has sizeof == 16 and alignof == 16 since alignments have to be a power of two and the size must usually be at least as large as the alignment.

Solution

It looks like a big missed-optimization for code that needs to copy long double without processing it. fstp m80/fld m80 round-trip latency 8 cycles on Skylake, vs. 5 for movdqa store-forwarding from store to reload. More importantly, Agner lists fstp m80 as one per 5 clocks throughput, so there's something non-pipelined going on!

The only possible benefit I can think of is store-forwarding from a still-in-flight long double store. Consider a data-dependency chain that involves some x87 math, a long double store, then your function, then a long double load and more x87 math. According to Agner's tables, fld/fstp will add 8 cycles, but movdqa will see a store-forwarding stall and add 5 + 11 cycles or so for a slow-path store-fowarding.

Probably the lowest latency strategy to copy an m80 would be 64-bit + 16-bit integer mov/movzx load/store instructions. We know that fstp m80 and fld m80 use 2 separate store-data (port 4) or load (p23) uops, and I think we can assume it's broken up as 64-bit mantissa and 16-bit sign:exponent.

Of course for throughput, and latency in cases other than store-forwarding, movdqa seems like by far the best choice because as you point out, the ABI guarantees 16-byte alignment. A 16-byte store can forward to a fld m80.

The same argument applies for copying double or float with integer vs. x87 (e.g. 32-bit code): fld m32/fstp m32 has 1 cycle higher round-trip latency than SSE movd, and 2 cycles higher latency than integer mov on Sandybridge-family CPUs. (Unlike PowerPC / Cell load-hit-store, there's no penalty for store-forwarding from FP stores to integer loads. x86's strong memory ordering model wouldn't allow separate store buffers for FP vs. integer, if that's what PPC does.)

Once the compiler realizes that it's not going to use any FP instructions on a float / double / long double, it should usually replace the load/store with non-x87. But copying a double or float with x87 is fine if integer / SSE register pressure is a problem.

Integer register pressure in 32-bit code is almost always high, and -mfpmath=sse is the default for 64-bit code. You could imagine rare cases where using x87 to copy a double in 64-bit code would be worth it, but compilers would be more likely to make things worse than better if they went looking for places to use x87. gcc has -mfpmath=sse+387, but it's not usually very good. (And that's not even considering physical register file pressure from using x87 + SSE. Hopefully an "empty" x87 state doesn't use any physical registers. xsave knows about parts of the architectural state being empty so it can avoid saving them...)