Search code examples
c++pass-by-referencepass-by-valuemicro-optimization

Passing as `const&` lightweight objects


Given the following pet snippet:

template<class T1, class T2>
struct my_pair { /* constructors and such */ };

auto f(std::pair<T1, T2> const& p) // (1)
{ return my_pair<T1, T2>(p.first, p.second); }

auto f(std::pair<T1, T2> p) // (2)
{ return my_pair<T1, T2>(p.first, p.second); }

If I know that both T1 and T2 are lightweight objects whose copy time is negligible (a couple of pointers each for instance), is it better to pass the std::pair as copy than as a reference? Because I know that sometimes is better to let the compiler omit copies than force it to deal with references (for example, to optimize copy-chains).

The same question applies to my_pair's constructors, if it's better to let them receive copies than references.

The calling context is unknown but the object generators and the class constructor themselves are all inline functions, so maybe the differences between references and values are not important because the optimizer can see the final target and apply construction at the end of the road (I'm just speculating), so the object generators would be pure zero-overhead abstractions, in which case I think the reference would be better just in case some exceptional pair is bigger than usual.

But if that's not the case (references always or usually have some impact respect to copies, even if everything is inline), then I will go for the copies.


Solution

  • Outside of the domain of micro-optimization, I would generally pass a const reference since you aren't modifying the object and you'd like to avoid the copy. If one day you do use a T1 or T2 that are expensive to construct, the copy is potentially a big issue: there is no equivalently big footgun with passing a const reference. So I look at passing by value as a choice with very asymmetric tradeoffs and choose by value only when I know the data is small.

    As to your specific micro-optimization question it basically depends on whether the call gets fully inlined and your compiler is decent.

    Full Inlining

    If either variant of your f function gets inlined into the caller, and optimization is enabled, you are likely to get identical or nearly identical code for either variant. I test that here with the inline_f_ref and inline_r_val calls. They both generate a pair from an unknown external function and then call either the by-reference or by-variant of f.

    Like this for f_val (the f_ref version only changes the call at the end):

    template <typename T>
    auto inline_f_val() {
        auto pair = get_pair<T>();
        return f_val(pair);
    }
    

    Here are the results on gcc when T1 and T2 are int:

    auto inline_f_ref<int>():
            sub     rsp, 8
            call    std::pair<int, int> get_pair<int>()
            add     rsp, 8
            ret
    
    auto inline_f_val<int>():
            sub     rsp, 8
            call    std::pair<int, int> get_pair<int>()
            add     rsp, 8
            ret
    

    Totally identical. The compiler sees right through the functions and even recognizes that std::pair and mypair actually have the same layout so all trace of f disapears.

    Here's a version with T1 and T2 being a structure with two pointers, instead:

    auto inline_f_ref<twop>():
            push    r12
            mov     r12, rdi
            sub     rsp, 32
            mov     rdi, rsp
            call    std::pair<twop, twop> get_pair<twop>()
            mov     rax, QWORD PTR [rsp]
            mov     QWORD PTR [r12], rax
            mov     rax, QWORD PTR [rsp+8]
            mov     QWORD PTR [r12+8], rax
            mov     rax, QWORD PTR [rsp+16]
            mov     QWORD PTR [r12+16], rax
            mov     rax, QWORD PTR [rsp+24]
            mov     QWORD PTR [r12+24], rax
            add     rsp, 32
            mov     rax, r12
            pop     r12
            ret
    

    That's the "ref" version and again the "val" version is identical. Here the compiler isn't able to optimize all the work: it still does a bunch of work to copy the std::pair content to the mypair object after creating the pair (there are 4 stores storing a total of 32 bytes, that's 4 pointers). So inlining again let the compiler optimize versions to the same thing.

    You can probably find cases where that's not the case, but they are unusual in my experience.

    Without Inlining

    Without inlining it's a different story. You mention that all your functions are inline, but that doesn't necessarily mean the compiler will inline them. gcc in particular is more reluctant than average to inline functions (e.g., it didn't inline the very short functions in this example at -O2 without the inline keyword).

    Without inlining the way parameters are passed and returned is set by the ABI, so the compiler cannot optimize away the differences between the two versions. The const reference version amounts to passing a pointer, so regardless of T1 and T2 you'll pass a pointer to the std::pair object in the first integer register.

    Here's the code that leads to when T1 and T2 are int, in gcc on Linux:

    auto f_ref<int, int>(std::pair<int, int> const&):
            mov     rax, QWORD PTR [rdi]
            ret
    

    The pointer the std::pair is passed in rdi and so the body of the function is a single 8-byte move from that location into rax. A std::pair<int, int> takes 8 bytes, so the compiler copies the whole thing in one shot. In this case, the return value is passed "by value" in rax, so we are done.

    This is dependent on both the compiler's ability to optimize and on the ABI. For example, here's the same function compiled by MSVC for a 64-bit Windows target:

    my_pair<int,int> f_ref<int,int>(std::pair<int,int> const &) PROC ; f_ref<int,int>, COMDAT
            mov     eax, DWORD PTR [rdx]
            mov     r8d, DWORD PTR [rdx+4]
            mov     DWORD PTR [rcx], eax
            mov     rax, rcx
            mov     DWORD PTR [rcx+4], r8d
            ret     0
    

    There are two different things happening here. First, the ABI is different. MSVC cannot return mypair<int,int> in rax. Instead, the caller passes in rcx a pointer to a location where the callee should save the result. So this function has stores in addition to loads. rax is loaded with the location of the saved data. The second thing is that the compiler is too dumb to combine the two adjacent 4-byte loads and stores into 8-byte ones, so there are two loads and two stores.

    The second part could be fixed by a better compiler, but the first one is a consequence of the API.

    Here's the by value version of this function, in gcc on Linux:

    auto f_val<int, int>(std::pair<int, int>):
            mov     rax, rdi
            ret
    

    Still only a single instruction, but this time a single reg-reg move, which is never more expensive than a load, and usually considerably cheaper.

    On MSVC, 64-bit Windows:

    my_pair<int,int> f_val<int,int>(std::pair<int,int>)
            mov     rax, rdx
            mov     DWORD PTR [rcx], edx
            shr     rax, 32                             ; 00000020H
            mov     DWORD PTR [rcx+4], eax
            mov     rax, rcx
            ret     0
    

    You still have two stores, because the ABI still forces the value to be returned in memory, but the loads are gone because the MSVC 64-bit API allows arguments up to 64-bits in size to be passed in a register.

    Then compiler goes and does a really dumb thing: starting with the 64-bits of std::pair in rax, it writes out the bottom 32 bits, shifts the top 32 bits to the bottom and then writes those out. The world's slowest way to simply write out 64 bits. Still, this code will generally be faster than the by-reference version.

    In both of ABIs the by-value function was able to pass its argument in a register. This has its limit, however. Here's the by-reference version of f when T1 and T2 are twop - a structure containing two pointers, Linux gcc:

    auto f_ref<twop, twop>(std::pair<twop, twop> const&):
            mov     rax, rdi
            mov     r8, QWORD PTR [rsi]
            mov     rdi, QWORD PTR [rsi+8]
            mov     rcx, QWORD PTR [rsi+16]
            mov     rdx, QWORD PTR [rsi+24]
            mov     QWORD PTR [rax], r8
            mov     QWORD PTR [rax+8], rdi
            mov     QWORD PTR [rax+16], rcx
            mov     QWORD PTR [rax+24], rdx
    

    Here's the by-value version:

    auto f_val<twop, twop>(std::pair<twop, twop>):
            mov     rdx, QWORD PTR [rsp+8]
            mov     rax, rdi
            mov     QWORD PTR [rdi], rdx
            mov     rdx, QWORD PTR [rsp+16]
            mov     QWORD PTR [rdi+8], rdx
            mov     rdx, QWORD PTR [rsp+24]
            mov     QWORD PTR [rdi+16], rdx
            mov     rdx, QWORD PTR [rsp+32]
            mov     QWORD PTR [rdi+24], rdx
    

    Although the loads and stores are ordered differently, both are doing exactly the same thing: 4 loads and 4 stores, copying 32 bytes from input to output. The only real difference is that in the by-value case the object is expected on the stack (hence we copy from [rsp]) and in the by-reference case the object is pointed to by the first argument, hence we copy from [rdi]1.

    So there is a smallish window where non-inlined by-value functions have an advantage over pass-by-reference: the window where their arguments can be passed in registers. For the Sys V ABI this usually applies to structures up to 16 bytes, and on Windows x86-64 ABI up to 8 bytes. There are other restrictions as well, so not all objects of this size are always passed in registers.


    1 You might be saying, hey, rdi takes the first argument, not rsi - but what happens here is that the return value also has to be passed through memory, so a hidden first argument - a pointer to the destination buffer for the return value - is implicitly used and goes into rdi.