For example with such function,
int fb(char a, char b, char c, char d) {
return (a + b) - (c + d);
}
gcc
's assembly output is,
fb:
movsx esi, sil
movsx edi, dil
movsx ecx, cl
movsx edx, dl
add edi, esi
add edx, ecx
mov eax, edi
sub eax, edx
ret
Vaguely, I understand that the purpose of movsx
is to remove the dependency from the previous value of the register, but honestly I still don't understand exactly what kind of dependency it is trying to remove. I mean, for example, whether or not there is movsx esi, sil
, if some value is being written to esi
, then any operation using esi
will have to wait, if a value is being read from esi
, any operation modifying the value of esi
will have to wait, and if esi
isn't being used by any operation, the code will continue to run. What difference does movsx
make? I cannot say the compiler is doing wrong because movsx
or movzx
is (almost?) always produced by any compiler whenever loading values smaller than 32-bits.
Apart from my lack of understanding, gcc
behaves differently with float
s.
float ff(float a, float b, float c, float d) {
return (a + b) - (c + d);
}
is compiled to,
ff:
addss xmm0, xmm1
addss xmm2, xmm3
subss xmm0, xmm2
ret
If the same logic was applied, I believe the output should be something like,
ff:
movd xmm0, xmm0
movd xmm1, xmm1
movd xmm2, xmm2
movd xmm3, xmm3
addss xmm0, xmm1
addss xmm2, xmm3
subss xmm0, xmm2
ret
So I'm actually asking 2 questions.
gcc
behave differently with float
s?movsx
make?The return value is the same width as the args so no extension is needed. The parts of registers outside the type widths are allowed to hold garbage in x86 and x86-64 calling conventions. (This applies to both GP integer and vector registers.)
Except for an undocumented extension which clang depends on, where callers extend narrow args to 32-bit; clang will skip the movsx
instructions in your char
example. https://godbolt.org/z/Gv5e4h3Eh
Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI? covers both the high garbage and the unofficial extension to the calling convention.
Since you asked about false dependencies, note that compilers do use movaps xmm,xmm
to copy a scalar. (e.g. in GCC's missed optimizations in (a-b) + (a-d)
we need to subtract from a
twice. It's non-commutative so we need a copy: https://godbolt.org/z/Tvx19raa3
Indeed, movss xmm1, xmm0
has a dependency on XMM1 where movaps
doesn't, and it would be a false dependency if you didn't actually care about merging with the old high bytes.
(Tuning for Pentium III or Pentium M might make sense to use movss
because it was single-uop there, but current GCC -O3 -m32 -mtune=pentium3 -mfpmath=sse
uses movaps, spending the 2nd uop to avoid a false dependency. It wasn't until Core2 that the SIMD execution units widened to 128-bit for P6 family, matching Pentium 4.)
C integer promotion rules mean that a+b
for narrow inputs is equivalent to (int)a + (int)b
. In all x86 / x86-64 ABIs, char
is a signed type (unlike on ARM for example), so it needs to be sign extended to int
width, not zero extended. And definitely not truncated.
If you truncated the result again by returning a char
, compilers could if they wanted just do 8-bit adds. But actually they'll use 32-bit adds and leave whatever high garbage there: https://godbolt.org/z/hGdbecPqv. It's not doing this for dep-breaking / performance, just correctness.
As far as performance, GCC's behaviour of reading the 32-bit reg for a char
is good if the caller wrote the full register (which the unofficial extension to the calling convention requires anyway), or on CPUs that don't rename low 8 separately from the rest of the reg (everything other than P6-family: SnB-family only renames high-8 regs, except for original Sandybridge itself. Why doesn't GCC use partial registers?)
PS: there's no such instruction as movd xmm0, xmm0
, only a different form of movq xmm0, xmm0
which yes would zero-extend the low 64 bits of an XMM register into the full reg.
If you want to see various compiler attempts to zero-extend the low dword, with/without SSE4.1 insertps
, look at asm for __m128 foo(float f) { return _mm_set_ss(f); }
in the Godbolt link above. e.g. with just SSE2, zero a register with pxor, then movss xmm1, xmm0
. Otherwise, insertps
or xor-zero and blendps
.