c gcc struct compiler-optimization micro-optimization

Passing a struct with constant data to a function in C

I have code with a loop that calls a function with a compile time constant structure. I am aware that one must use a compound literal in order to have an anonymous struct (or array) literal.

Since a compound literal has the lifetime of the scope it was created in, I was concerned that using a compound literal might cause the struct to be unnecessarily recreated and destroyed each time the loop runs. To avoid this, I moved the struct to a variable outside the loop.

struct Struct {
    int x, y, z;
};
extern void func(struct Struct);
void test1(void) {
    for (int i = 1000000; i--;)
        func((struct Struct){1, 2, 3});
}
void test0(void) {
    const struct Struct s = {1, 2, 3};
    for (int i = 1000000; i--;)
        func(s);
}

I tested the above via Godbolt, and as expected, test1 seems to construct the data inside the loop. I also expected an optimizing compiler to identify this construct and optimize accordingly, so I added -Os, but the functions still differ in assembly, despite having identical behavior.

I was told that an optimizing compiler should generate optimal (and thus identical) code for constructs with identical behavior.

Which option is actually more likely to offer better performance?

Solution

Which option is actually more likely to offer better performance?

It depends on compiler missed optimization bugs in specific versions of different compilers. There is no general answer; compilers should be making identical asm for this.

And it's silly for GCC to be loading that constant data from memory, instead of mov esi, 3 inside the loop. mov r12, 0x200000001 outside the loop makes more sense than loading a constant, too. The first version looks good with gcc -Os for x86-64. With -O3 it gets worse, but is still not as bad as the way GCC compiles the 2nd version where it loads both from .rodata.

When you find different asm for equivalent source, usually it's either a missed optimization or two different but valid tuning choices. In this case loading from memory is a missed optimization bug in GCC, which you should report (https://gcc.gnu.org/bugzilla/, use the keyword [missed-optimization]).

Clang gets it right in both cases: https://godbolt.org/z/TP1Mdx6vn

test1:                                  # @test1
        push    rbp
        push    rbx
        push    rax       # dummy push to align the stack
        mov     ebp, -1000000
        movabs  rbx, 8589934593
.LBB0_1:                               # =>This Inner Loop Header: Depth=1
        mov     rdi, rbx
        mov     esi, 3
        call    func@PLT
        inc     ebp
        jne     .LBB0_1            # }while(++i != 0)

        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

It makes some sense to hoist the 10-byte mov r64, imm64 instruction out of the loop, like clang and GCC have been doing. It can take an extra cycle to fetch from the uop cache on Sandybridge-family, according to Agner Fog's microarch guide. The loop runs many iterations, plenty to amortize the extra cost of saving/restoring another call-preserved register.

But mov esi, 3 is only a 5-byte instruction, plenty cheap enough to just leave it in the loop, even if func is actually very short so this turns out to be a pretty tight loop.

When GCC or clang want a 64-bit constant in an integer register, they normally use mov-immediate. Something about the 64-bit value being 2 struct members seems to be confusing GCC into loading it from .rodata in some cases.

But with -Os (optimize for size) it's able to avoid that, which is kind of surprising. It's true that a 10-byte mov-immediate is smaller total size than 8 bytes of data plus a 7-byte mov r64, [rip+rel32] (REX + opcode + ModRM + rel32).

But if it's able to see a mov-immediate as an option at all, IDK why it wouldn't pick it for performance when GCC would normally do that when calling int foo(uint64_t, int).

Compilers are very complex pieces of machinery, and work by transforming internal representations of the program logic. Often they arrive at the same efficient result from different starting points, but not always.