c gcc x86-64 compiler-optimization micro-optimization

Is this a missed optimization in GCC, loading an 16-bit integer value from .rodata instead of an immediate store?

Looking for this code:

#include <stdint.h>

extern struct __attribute__((packed))
{
    uint8_t size;
    uint8_t pad;
    uint16_t sec_num;
    uint16_t offset;
    uint16_t segment;
    uint64_t sec_id;
} ldap;

//uint16_t x __attribute__((aligned(4096)));

void kkk()
{
    ldap.size = 16;
    ldap.pad = 0;
    //x = 16;
}

After compiling it with -O2, -O3 or -Ofast, it will be :

    .globl  kkk
    .type   kkk, @function
kkk:
    movzwl  .LC0(%rip), %eax
    movw    %ax, ldap(%rip)
    ret
    .size   kkk, .-kkk
    .section    .rodata.cst2,"aM",@progbits,2
    .align 2
.LC0:
    .byte   16
    .byte   0

I think the best is :

kkk:
    movw    $16, ldap(%rip)
    ret

and this is also OK:

kkk:
    movl    $16, %eax
    movw    %ax, ldap(%rip)
    ret

But I really don't know what rodata .LC0 does?

I'm using GCC 12.2 as the compiler, installed by apt on Ubuntu 22.10.

Solution

Near duplicate, I thought I'd already answered this but didn't find the Q&A right away: Why does the short (16-bit) variable mov a value to a register and store that, unlike other widths?

This question also has a separate missed optimization when looking at the two 8-bit assignments, not one 16-bit integers. Also, update: this GCC12 regression is already fixed in GCC trunk; sorry I forgot to check that before suggesting that you report it upstream.

It's avoiding Length-Changing Prefix (LCP) stalls for 16-bit immediates, but those don't exist for mov on Sandybridge and later so it should stop doing that for tune=generic :P You're correct, movw $16, ldap(%rip) would be optimal. That's what GCC uses when tuning for non-Intel uarches like -mtune=znver3. Or at least what older versions did which didn't have the other missed optimization of loading from .rodata.

It's insane that it's loading a 16 from the .rodata section instead of using it as an immediate. The movzwl load with a RIP-relative addressing mode is already as large as mov $16, %eax, so you're correct about that. (.rodata is the section where GCC puts string literals, const variables whose address is taken or otherwise can't be optimized away, etc. Also floating-point constants; loading a constant from memory is normal for FP/SIMD since x86 lacks a mov-immediate to XMM registers, but it's rare even for 8-byte integer constants.)

GCC11 and earlier did mov $16, %eax / movw %ax, ldap(%rip) (https://godbolt.org/z/7qrafWhqd), so that's a GCC12 regression you should report on https://gcc.gnu.org/bugzilla

Loading from .rodata doesn't happen with x = 16 alone (https://godbolt.org/z/ffnjnxjWG). Presumably some interaction with coalescing two separate 8-bit stores into a 16-bit store trips up GCC.

uint16_t x __attribute__((aligned(4096)));
void store()
{
    //ldap.size = 16;
    //ldap.pad = 0;
    x = 16;
}

# gcc12 -O3 -mtune=znver3
store:
        movw    $16, x(%rip)
        ret

Or with the default tune=generic, GCC12 matches GCC11 code-gen.

store:
        movl    $16, %eax
        movw    %ax, x(%rip)
        ret

This is optimal for Core 2 through Nehalem (Intel P6-family CPUs that support 64-bit mode, which is necessary for them to be running this code in the first place.) Those are obsolete enough that it's maybe time for current GCC to stop spending extra code-size and instructions and just mov-immediate to memory, since mov imm16 opcodes specifically get special support in the pre-decoders to avoid an LCP stall, where there would be one with add $1234, x(%rip). See https://agner.org/optimize/, specifically his microarch PDF. (add sign_extended_imm8 exists, mov unfortunately doesn't, so add $16, %ax wouldn't cause a problem, but $1234 would.)

But since those old CPUs don't have a uop cache, an LCP stall in an inner loop could make things much slower in the worst case. So it's maybe worth making somewhat slower code for all modern CPUs in order to avoid that big pothole on the slowest CPUs.

Unfortunately GCC doesn't know that SnB fixed LCP stalls on mov: -O3 -march=haswell still does a 32-bit mov-immediate to a register first. So -march=native on modern Intel CPUs will still make slower code :/

-O3 -march=alderlake does use mov-imm16; perhaps they updated the tuning for it because it also has E-cores which are silvermont-family.