In the third chapter of Computer System: A Programmer's Prespective, an example program is given when talking about shift operations:
long shift_left4_rightn(long x, long n)
{
x <<= 4;
x >>= n;
return x;
}
And its assembly code is as follows (reproducible with GCC10.2 -O1
for x86-64 on the Godbolt compiler explorer. -O2
schedules the instructions in a different order but still uses movl
to ECX):
shift_left4_rightn:
endbr64
movq %rdi, %rax Get x
salq $4, %rax x <<= 4
movl %esi, %ecx Get n
sarq %cl, %rax x >>= n
ret
I wonder why the assembly code of getting n is movl %esi, %ecx
instead of movq %rsi, %rcx
since n
is a quad-word.
On the other hand, movb %sil, %cl
might be more suitable if the optimation is considered, since the shift amount only use the single-byte register element %cl
and those higher bits are all ignored.
As a result, I really fail to figure out the reason for using "movl %esi, %ecx" when dealing with long integer.
Yes, GCC realizes that upper bits are ignored by sar
.
Then movl
is the natural consequence of applying two simple optimization rules:
Fun fact: even if the arg had been uint8_t
, compiles would still hopefully use movl %esi, %ecx
. You'd think reading a wider register when the arg value is only in SIL could create a partial-register stall, but an unofficial extension to the x86-64 System V calling convention is that callers should zero or sign extend narrow args to at least 32-bit. So we can assume it was written with at least a 32-bit operation.
The specific downsides of some other choices:
movq %rsi, %rcx
- waste of a REX prefix (code-size downside).movb %sil, %cl
- writes a partial register, and still needs a REX prefix to access SIL.movzbl %sil, %ecx
- code size: 2-byte opcode, and needs a REX to read SIL. Also, AMD CPUs only do mov-elimination (zero latency) for movl
/ movq
, not movzx.movw %si, %cx
- zero advantages, needs an operand-size prefix and writes a partial register.movzwl %si, %ecx
- Tied with movq
for code-size, but defeats mov-elimination even on Intel CPUs.Fun fact: if we pad with a dummy arg so n
arrives in RDX, GCC still chooses movl %edx, %ecx
, even though movb %dl, %cl
is the same code-size (no REX needed to access DL). So yes, GCC is definitely avoiding byte operand-size.
Fun fact 2: Clang unfortunately does waste a REX on movq
, missing this optimization. https://godbolt.org/z/6GWhMd
But if we make the count arg unsigned char
, clang and GCC do both use movl
instead of movb
, fortunately. https://godbolt.org/z/e95WP8