Search code examples
c++gccclangcalling-conventionstdtuple

Is returning a 2-tuple less efficient than std::pair?


Consider this code:

#include <utility>
#include <tuple>

std::pair<int, int> f1()
{
    return std::make_pair(0x111, 0x222);
}

std::tuple<int, int> f2()
{
    return std::make_tuple(0x111, 0x222);
}

Clang 3 and 4 generate similar code for both on x86-64:

f1():
 movabs rax,0x22200000111
 ret    
f2():
 movabs rax,0x11100000222 ; opposite packing order, not important
 ret    

But Clang 5 generates different code for f2():

f2():
 movabs rax,0x11100000222
 mov    QWORD PTR [rdi],rax
 mov    rax,rdi
 ret    

As do GCC 4 through GCC 7:

f2():
 movabs rdx,0x11100000222
 mov    rax,rdi
 mov    QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores
 ret

Why is the generated code worse when returning a std::tuple that fits in a single register, vs std::pair? It seems especially strange since Clang 3 and 4 seemed to be optimal, yet 5 is not.

Try it here: https://godbolt.org/g/T2Yqrj


Solution

  • The short answer is because the libstc++ standard library implementation used by gcc and clang on Linux implements std::tuple with a non-trivial move constructor (in particular, the _Tuple_impl base class has a non-trivial move constructor). On the other hand, the copy and move constructors for std::pair are all defaulted.

    This in turn causes a C++-ABI related difference in the calling convention for returning these objects from functions, as well as passing them by value.

    The Gory Details

    You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two int fields in these structures will get the INTEGER class or the MEMORY class.

    A recent version of the ABI specification has this to say:

    The classification of aggregate (structures and arrays) and union types works as follows:

    1. If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .
    2. If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER) 14 .
    3. If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class NO_CLASS.
    4. Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according to the classes of the fields in the eightbyte

    It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which gcc is documented to follow does include move constructors:

    If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. Specifically:

    • Space is allocated by the caller in the usual manner for a temporary, typically on the stack.

    and then the definition of non-trivial:

    A type is considered non-trivial for the purposes of calls if:

    • it has a non-trivial copy constructor, move constructor, or destructor, or
    • all of its copy and move constructors are deleted.

    So because tuple is not considered to be trivially copyable from an ABI perspective, it gets MEMORY treatment, which means that your function must populate the stack allocated object passed in by the called in rdi. The std::pair function can just pass back the entire structure in rax since it fits in one EIGHTBYTE and has class INTEGER.

    Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for tuple since this ABI different is "baked in".

    Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your f1() and f2() functions and do some math on the result:

    int add_pair() {
      auto p = f1();
      return p.first + p.second;
    }
    
    int add_tuple() {
      auto t = f2();
      return std::get<0>(t) + std::get<1>(t);
    }
    

    In principle the add_tuple method starts from a disadvantage, since it has to call f2() which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it to f2 as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:

    add_pair():
      mov eax, 819
      ret
    add_tuple():
      mov eax, 819
      ret
    

    So overall you can say that the effect of this ABI issue with tuple will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).

    libcstc++ vs libc+++

    As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like f1() for the std::tuple case they would break ABI compliant callers.

    You can see this clearly if you switch to using libc++ rather than the Linux default of libstdc++ - this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Now clang (and presumably gcc although I didn't try it), generates the same optimal code in both cases:

    f1():                                 # @f1()
            movabs  rax, 2345052143889
            ret
    f2():                                 # @f2()
            movabs  rax, 2345052143889
            ret
    

    Earlier Versions of Clang

    Why do versions of clang compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc or newer versions of clang. The spec was eventually updated and the clang behavior changed in version 5.0.

    Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and clang changed their behavior at some point, which probably explains the switch:

    The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.