Why is there a locked xadd instruction in this disassambled std::string dtor?

I have a very simple code:

#include <string>
#include <iostream>

int main() {
    std::string s("abc");
    std::cout << s;
}

Then, I compiled it:

g++ -Wall test_string.cpp -o test_string -std=c++17 -O3 -g3 -ggdb3

And then decompiled it, and the most interesting piece is:

00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:
  4009a0:       48 81 ff a0 11 60 00    cmp    rdi,0x6011a0
  4009a7:       75 01                   jne    4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
  4009a9:       c3                      ret    
  4009aa:       b8 00 00 00 00          mov    eax,0x0
  4009af:       48 85 c0                test   rax,rax
  4009b2:       74 11                   je     4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>
  4009b4:       83 c8 ff                or     eax,0xffffffff
  4009b7:       f0 0f c1 47 10          lock xadd DWORD PTR [rdi+0x10],eax
  4009bc:       85 c0                   test   eax,eax
  4009be:       7f e9                   jg     4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
  4009c0:       e9 cb fd ff ff          jmp    400790 <_ZdlPv@plt>
  4009c5:       8b 47 10                mov    eax,DWORD PTR [rdi+0x10]
  4009c8:       8d 50 ff                lea    edx,[rax-0x1]
  4009cb:       89 57 10                mov    DWORD PTR [rdi+0x10],edx
  4009ce:       eb ec                   jmp    4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>

Why _ZNSs4_Rep10_M_disposeERKSaIcE.isra.10 (which is std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Rep::_M_dispose(std::allocator<char> const&) [clone .isra.10]) is a lock prefixed xadd?

A follow-up question is how I can avoid it?

Solution

It looks like code associated with copy on write strings. The locked instruction is decrementing a reference count and then calling operator delete only if the reference count for the possibly shared buffer containing the actual string data is zero (i.e., it is not shared: no other string object refers to it).

Since libstdc++ is open source, we can confirm this by looking at the source!

The function you've disassembled, _ZNSs4_Rep10_M_disposeERKSaIcE de-mangles¹ to std::basic_string<char>::_Rep::_M_dispose(std::allocator<char> const&). Here's the corresponding source for libstdc++ in the gcc-4.x era²:

    void
    _M_dispose(const _Alloc& __a)
    {
#if _GLIBCXX_FULLY_DYNAMIC_STRING == 0
      if (__builtin_expect(this != &_S_empty_rep(), false))
#endif
        {
          // Be race-detector-friendly.  For more info see bits/c++config.
          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&this->_M_refcount);
          if (__gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount,
                             -1) <= 0)
        {
          _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&this->_M_refcount);
          _M_destroy(__a);
        }
        }
    }  // XXX MT

Given that, we can annotate the assembly you provided, mapping each instruction back to the C++ source:

00000000004009a0 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10>:

  # the next two lines implement the check:
  # if (__builtin_expect(this != &_S_empty_rep(), false))
  # which is an empty string optimization. The S_empty_rep singleton
  # is at address 0x6011a0 and if the current buffer points to that
  # we are done (execute the ret)
  4009a0: cmp    rdi,0x6011a0
  4009a7: jne    4009aa <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0xa>
  4009a9: ret

  # now we are in the implementation of
  # __gnu_cxx::__exchange_and_add_dispatch(&this->_M_refcount, -1)
  # which dispatches either to an atomic version of the add function
  # or the non-atomic version, depending on the value of `eax` which
  # is always directly set to zero, so the non-atomic version is 
  # *always called* (see details below)
  4009aa: mov    eax,0x0
  4009af: test   rax,rax
  4009b2: je     4009c5 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x25>

  # this is the atomic version of the decrement you were concerned about
  # but we never execute this code because the test above always jumps
  # to 4009c5 (the non-atomic version)
  4009b4: or     eax,0xffffffff
  4009b7: lock xadd DWORD PTR [rdi+0x10],eax
  4009bc: test   eax,eax
  # check if the result of the xadd was zero, if not skip the delete
  4009be: jg     4009a9 <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x9>
  # the delete call
  4009c0: jmp    400790 <_ZdlPv@plt> # tailcall

  # the non-atomic version starts here, this is the code that is 
  # always executed
  4009c5: mov    eax,DWORD PTR [rdi+0x10]
  4009c8: lea    edx,[rax-0x1]
  4009cb: mov    DWORD PTR [rdi+0x10],edx
  # this jumps up to the test eax,eax check which calls operator delete
  # if the refcount was zero
  4009ce: jmp    4009bc <_ZNSs4_Rep10_M_disposeERKSaIcE.isra.10+0x1c>

A key note is that the lock xadd code you were concerned about is never executed. There is a mov eax, 0 followed by a test rax, rax; je - this test always succeeds and the jump always occurs because rax is always zero.

What's happening here is that __gnu_cxx::__atomic_add_dispatch is implemented in a way that it checks whether the process is definitely single threaded. If it is definitely single threaded, it doesn't bother to use expensive atomic instructions for things like __atomic_add_dispatch - it simply uses a regular non-atomic addition. It does this by checking the address of a pthreads function, __pthread_key_create - if this is zero, the pthread library hasn't been linked in, and hence the process is definitely single threaded. In your case, the address of this pthread function gets resolved at link time to 0 (you didn't have -lpthread on your compile command line), which is where the mov eax, 0x0 comes from. At link time, it's too late to optimize on this knowledge, so the vestigial atomic increment code remains but never executes. This mechanism is described in more detail in this answer.

The code that does execute is the last part of the function, starting at 4009c5. This code also decrements the reference count, but in a non-atomic way. The check which decides between these two options is probably based on whether the process is multithreaded or not, e.g., whether -lpthread has been linked. For whatever reason this check, inside __exchange_and_add_dispatch, is implemented in a way that prevents the compiler from actually removing the atomic half of the branch, even though the fact that it will never be taken is known at some point during the build process (after all, the hard-coded mov eax, 0 got there somehow).

A follow-up question is how I can avoid it?

Well you've already avoided the lock add part, so if that's what you care about, your good to go. However, you still have a cause for concern:

Copy on write std::string implementations are not standards compliant due to changes made in C++11, so the question remains why exactly you are getting this COW string behavior even when specifying -std=c++17.

The problem is most likely distribution related: CentOS 7 by default uses an ancient gcc version < 5 which still uses the non-compliant COW strings. However, you mention that you are using gcc 8.2.1, which by default in a normal install which uses non-COW strings. It seems like if you installed 8.2.1 use the RHEL "devtools" method, you'll get a new gcc which still uses the old ABI and links against the old system libstdc++.

To confirm this, you might want to check the value of _GLIBCXX_USE_CXX11_ABI macro in your test program, and also your libstdc++ version (the version information here might prove useful).

You can avoid by using an OS other than CentOS that doesn't use ancient gcc and glibc version. If you need to stick with CentOS for some reason you'll have to look into if there is a supported way to use newer libstdc++ version on that distribution. You could also consider using a containerization technology to build an executable independent of the library versions of your local host.

¹ You can demangle it like so: echo '_ZNSs4_Rep10_M_disposeERKSaIcE' | c++filt.

² I'm using gcc-4 era source since I'm guessing that's what you end up using in CentOS 7.