I noticed a weird thing today. When copying a long double
1 all of gcc
, clang
and icc
generate fld
and fstp
instructions, with TBYTE
memory operands.
That is, the following function:
void copy_prim(long double *dst, long double *src) {
*src = *dst;
}
Generates the following assembly:
copy_prim(long double*, long double*):
fld TBYTE PTR [rdi]
fstp TBYTE PTR [rsi]
ret
Now according to Agner's tables this is a poor choice for performance, as fld
takes four uops (none fused) and fstp
takes a whopping seven uops (none fused) versus say a single fused uop each for movaps
to/from an xmm
register.
Interestingly, clang
starts using movaps
as soon as you put the long double
in a struct
. The following code:
struct long_double {
long double x;
};
void copy_ld(long_double *dst, long_double *src) {
*src = *dst;
}
Compiles to the same assembly with fld
/fstp
as previously shown for gcc
and icc
but clang
now uses:
copy_ld(long_double*, long_double*):
movaps xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsi], xmm0
ret
Oddly, if you stuff an additional int
member into the struct
(which doubles its size to 32 bytes due to alignment), all compilers generate SSE-only copy code:
copy_ldi(long_double_int*, long_double_int*):
movdqa xmm0, XMMWORD PTR [rdi]
movaps XMMWORD PTR [rsi], xmm0
movdqa xmm0, XMMWORD PTR [rdi+16]
movaps XMMWORD PTR [rsi+16], xmm0
ret
Is there any functional reason to copy floating point values with fld
and fstp
or is just a missed optimization?
1 Although a long double
(i.e., x86 extended precision float) is nominally 10 bytes on x86, it has sizeof == 16
and alignof == 16
since alignments have to be a power of two and the size must usually be at least as large as the alignment.
It looks like a big missed-optimization for code that needs to copy long double
without processing it. fstp m80
/fld m80
round-trip latency 8 cycles on Skylake, vs. 5 for movdqa
store-forwarding from store to reload. More importantly, Agner lists fstp m80
as one per 5 clocks throughput, so there's something non-pipelined going on!
The only possible benefit I can think of is store-forwarding from a still-in-flight long double
store. Consider a data-dependency chain that involves some x87 math, a long double
store, then your function, then a long double
load and more x87 math. According to Agner's tables, fld
/fstp
will add 8 cycles, but movdqa
will see a store-forwarding stall and add 5 + 11 cycles or so for a slow-path store-fowarding.
Probably the lowest latency strategy to copy an m80
would be 64-bit + 16-bit integer mov
/movzx
load/store instructions. We know that fstp m80
and fld m80
use 2 separate store-data (port 4) or load (p23) uops, and I think we can assume it's broken up as 64-bit mantissa and 16-bit sign:exponent.
Of course for throughput, and latency in cases other than store-forwarding, movdqa
seems like by far the best choice because as you point out, the ABI guarantees 16-byte alignment. A 16-byte store can forward to a fld m80
.
The same argument applies for copying double
or float
with integer vs. x87 (e.g. 32-bit code): fld m32
/fstp m32
has 1 cycle higher round-trip latency than SSE movd
, and 2 cycles higher latency than integer mov
on Sandybridge-family CPUs. (Unlike PowerPC / Cell load-hit-store, there's no penalty for store-forwarding from FP stores to integer loads. x86's strong memory ordering model wouldn't allow separate store buffers for FP vs. integer, if that's what PPC does.)
Once the compiler realizes that it's not going to use any FP instructions on a float
/ double
/ long double
, it should usually replace the load/store with non-x87. But copying a double
or float
with x87 is fine if integer / SSE register pressure is a problem.
Integer register pressure in 32-bit code is almost always high, and -mfpmath=sse
is the default for 64-bit code. You could imagine rare cases where using x87 to copy a double
in 64-bit code would be worth it, but compilers would be more likely to make things worse than better if they went looking for places to use x87. gcc has -mfpmath=sse+387
, but it's not usually very good. (And that's not even considering physical register file pressure from using x87 + SSE. Hopefully an "empty" x87 state doesn't use any physical registers. xsave
knows about parts of the architectural state being empty so it can avoid saving them...)