int main()
{
00211000 push ebp
00211001 mov ebp,esp
00211003 sub esp,10h
char charVar1;
short shortVar1;
int intVar1;
long longVar1;
charVar1 = 11;
00211006 mov byte ptr [charVar1],0Bh
shortVar1 = 11;
0021100A mov eax,0Bh
0021100F mov word ptr [shortVar1],ax
intVar1 = 11;
00211013 mov dword ptr [intVar1],0Bh
longVar1 = 11;
0021101A mov dword ptr [longVar1],0Bh
}
Other data types do not go through registers, but only short types go through registers. What's wrong?
GCC does the same thing, using mov reg, imm32
/ mov m16, reg
instead of mov mem, imm16
.
It's to avoid LCP stalls on Intel P6-family CPUs from 16-bit operand-size mov imm16
.
An LCP (length changing prefix) stall occurs when a prefix changes the length of the rest of the instruction compared to the same machine code bytes without prefixes.
mov word ptr [ebp - 8], 11
would involve a 66
prefix that makes the rest of the instruction 5 bytes (opcode + modrm + disp8 + imm16) instead of 7 (opcode + modrm + disp8 + imm32) for the same opcode / modrm.)
66 c7 45 f8 0b 00 mov WORD PTR [ebp-0x8],0xb
c7 45 f8 0b 00 00 00 mov DWORD PTR [ebp-0x8],0xb
^
opcode
This length change confuses the instruction-length finding stage (pre-decode) which happens before chunks of machine code are routed to the actual decoders. They're forced to back up and use a slower method that accounts for prefixes in the way they look at opcodes. (Parallel decode of x86 machine code is hard). The penalty for this backup can be up to 11 cycles depending on microarchitecture and alignment of the instruction and should be avoided if possible.
See Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction? for lots of details on what a Length Changing Prefix stall is, and the performance effect of stalling the pre-decode stage in Intel P6 and SnB-family CPUs for a few cycles, and that Sandybridge-family (modern mainstream Intel) special-cases mov
opcodes to avoid LCP stalls from 16-bit immediates.
mov
specifically doesn't have a problem on modern IntelSandybridge-family removed LCP stalls for mov
specifically (still exists for other instructions), so this tuning decision only helps Nehalem and earlier.
AFAIK, it's not a thing on Silvermont-family, nor on any AMD, so this is probably something MSVC and GCC should update for their tune=generic
since P6-family CPUs are less and less relevant these days. (And if latest dev versions of GCC / MSVC changed now, it would be another year or so before lots of software distributions / releases would be built with a new compiler.)
clang
doesn't do this optimization, and it's not a disaster even on old P6-family CPUs because most software doesn't use a lot of short
/ int16_t
variables. (And the bottleneck isn't always the front-end, often cache misses.)
Storing to the stack at all for this function is of course due to not enabling optimization. Since those variables aren't volatile
, they should be optimized away completely since nothing reads them later. When you want to make examples of asm output, don't write a main
, write a function that has to have some side-effect, e.g. storing through a pointer, or use volatile
.
void foo(short *p){
volatile short x = 123;
*p = 123;
}
Compiles with MSVC 19.14 -O2
(https://godbolt.org/z/eWhzhEsEa):
x$ = 8
p$ = 8
foo PROC ; COMDAT
mov eax, 123 ; 0000007bH
mov WORD PTR x$[rsp], ax
mov WORD PTR [rcx], ax
ret 0
foo ENDP
Or with GCC11.2 -O3
, which sucks even more, not CSEing/reusing the register constant
foo:
mov eax, 123
mov edx, 123
mov WORD PTR [rsp-2], ax
mov WORD PTR [rdi], dx
ret
But we can see that this is an Intel tuning since with -O3 -march=znver1
(AMD Zen 1):
foo:
mov WORD PTR [rsp-2], 123
mov WORD PTR [rdi], 123
ret
Unfortunately it still does the LCP-avoidance for mov
with -march=skylake
, so it doesn't know the full rules.
And if we use *p += 12345;
(a number big enough to not fit in an imm8
, which add allows unlike mov) instead of just =
, ironically GCC then uses a length-changing-prefix with -march=skylake
(as does MSVC), creating a stall: add WORD PTR [rdi], 12345
.