Search code examples
c++optimizationlanguage-lawyer

Why do C++ compilers not forward loads of consts across function calls?


Consider the following example code:

void black_box(int* foo);

int foo(int x) {
    black_box(&x);
    return x;
}

int bar(const int x) {
    black_box(const_cast<int*>(&x));
    return x;
}

Quoting cppreference:

A const object is

  • an object whose type is const-qualified, or
  • [...]

Such object cannot be modified: attempt to do so [..] indirectly (e.g., by modifying the const object through a reference or pointer to non-const type) results in undefined behavior.

The x in bar is declared const. The compiler should be allowed to assume that black_box will not change it. Despite that, the generated x86 assembly for foo and bar is the same, for both gcc and clang: https://godbolt.org/z/jGsKo56aG

push    rax
mov     dword ptr [rsp + 4], edi
lea     rdi, [rsp + 4]
call    black_box(int*)@PLT
mov     eax, dword ptr [rsp + 4]
pop     rcx
ret

For bar, why does the compiler not move x into a callee saved register, and save the load from memory after the function call?


Solution

  • For bar, why does the compiler not move x into a callee saved register, and save the load from memory after the function call?

    As it so often happens, some ideas sound good until you actually try them. Thanks to @Brian Bi for prompting me to do just that. As it turns out, compilers are usually pretty good at their job.

    If we actually tried what I suggested, we end up with the following assembly:

    push    rax
    mov     dword ptr [rsp + 4], edi
    lea     rdi, [rsp + 4]
    mov     ebx, eax                  ; move `x`(eax) into callee saved register (ebx)
    call    black_box(int*)@PLT
    mov     eax, ebx                  ; restore `x` into return value register (eax)
    pop     rcx
    ret
    

    While this avoids the load from stack, it uses an additional instruction, because the return value has to end up in eax, which gets clobbered by the call. The compiler's version can avoid any additional backup, because they just get the value from the stack. At this point, it is unclear to me whether this would be faster or slower in practice.

    The optimal assembly is probably:

    push    rdi
    mov     rdi, rsp
    call    black_box(int*)@PLT
    pop     rax
    ret
    

    But compilers have never really been good at these kinds of stack frame layout optimizations.