c assembly x86 cpu-architecture micro-optimization

Why does the short (16-bit) variable mov a value to a register and store that, unlike other widths?

int main()
{
00211000  push        ebp  
00211001  mov         ebp,esp  
00211003  sub         esp,10h  
    char charVar1;
    short shortVar1;
    int intVar1;
    long longVar1;
    
    charVar1 = 11;
00211006  mov         byte ptr [charVar1],0Bh  

    shortVar1 = 11;
0021100A  mov         eax,0Bh  
0021100F  mov         word ptr [shortVar1],ax  

    intVar1 = 11;
00211013  mov         dword ptr [intVar1],0Bh 
 
    longVar1 = 11;
0021101A  mov         dword ptr [longVar1],0Bh  
}

Other data types do not go through registers, but only short types go through registers. What's wrong?

Solution

GCC does the same thing, using mov reg, imm32 / mov m16, reg instead of mov mem, imm16.

It's to avoid LCP stalls on Intel P6-family CPUs from 16-bit operand-size mov imm16.

An LCP (length changing prefix) stall occurs when a prefix changes the length of the rest of the instruction compared to the same machine code bytes without prefixes.

mov word ptr [ebp - 8], 11 would involve a 66 prefix that makes the rest of the instruction 5 bytes (opcode + modrm + disp8 + imm16) instead of 7 (opcode + modrm + disp8 + imm32) for the same opcode / modrm.)

 66 c7 45 f8 0b 00          mov     WORD PTR [ebp-0x8],0xb
    c7 45 f8 0b 00 00 00    mov    DWORD PTR [ebp-0x8],0xb
    ^
  opcode

This length change confuses the instruction-length finding stage (pre-decode) which happens before chunks of machine code are routed to the actual decoders. They're forced to back up and use a slower method that accounts for prefixes in the way they look at opcodes. (Parallel decode of x86 machine code is hard). The penalty for this backup can be up to 11 cycles depending on microarchitecture and alignment of the instruction and should be avoided if possible.

See Does a Length-Changing Prefix (LCP) incur a stall on a simple x86_64 instruction? for lots of details on what a Length Changing Prefix stall is, and the performance effect of stalling the pre-decode stage in Intel P6 and SnB-family CPUs for a few cycles, and that Sandybridge-family (modern mainstream Intel) special-cases mov opcodes to avoid LCP stalls from 16-bit immediates.

`mov` specifically doesn't have a problem on modern Intel

Sandybridge-family removed LCP stalls for mov specifically (still exists for other instructions), so this tuning decision only helps Nehalem and earlier.

AFAIK, it's not a thing on Silvermont-family, nor on any AMD, so this is probably something MSVC and GCC should update for their tune=generic since P6-family CPUs are less and less relevant these days. (And if latest dev versions of GCC / MSVC changed now, it would be another year or so before lots of software distributions / releases would be built with a new compiler.)

clang doesn't do this optimization, and it's not a disaster even on old P6-family CPUs because most software doesn't use a lot of short / int16_t variables. (And the bottleneck isn't always the front-end, often cache misses.)

Examples

Storing to the stack at all for this function is of course due to not enabling optimization. Since those variables aren't volatile, they should be optimized away completely since nothing reads them later. When you want to make examples of asm output, don't write a main, write a function that has to have some side-effect, e.g. storing through a pointer, or use volatile.

void foo(short *p){
    volatile short x = 123;
    *p = 123;
}

Compiles with MSVC 19.14 -O2 (https://godbolt.org/z/eWhzhEsEa):

x$ = 8
p$ = 8
foo     PROC                                          ; COMDAT
        mov     eax, 123                      ; 0000007bH
        mov     WORD PTR x$[rsp], ax
        mov     WORD PTR [rcx], ax
        ret     0
foo     ENDP

Or with GCC11.2 -O3, which sucks even more, not CSEing/reusing the register constant

foo:
        mov     eax, 123
        mov     edx, 123
        mov     WORD PTR [rsp-2], ax
        mov     WORD PTR [rdi], dx
        ret

But we can see that this is an Intel tuning since with -O3 -march=znver1 (AMD Zen 1):

foo:
        mov     WORD PTR [rsp-2], 123
        mov     WORD PTR [rdi], 123
        ret

Unfortunately it still does the LCP-avoidance for mov with -march=skylake, so it doesn't know the full rules.

And if we use *p += 12345; (a number big enough to not fit in an imm8, which add allows unlike mov) instead of just =, ironically GCC then uses a length-changing-prefix with -march=skylake (as does MSVC), creating a stall: add WORD PTR [rdi], 12345.

Why does the short (16-bit) variable mov a value to a register and store that, unlike other widths?

mov specifically doesn't have a problem on modern Intel

Examples

`mov` specifically doesn't have a problem on modern Intel