c++multithreading boost reference-counting thread-sanitizer

ThreadSanitizer reports "data race on operator delete(void*)" when using embedded reference counter

Please have a look at the following code:

#include <pthread.h>
#include <boost/atomic.hpp>

class ReferenceCounted {
  public:
    ReferenceCounted() : ref_count_(1) {}

    void reserve() {
      ref_count_.fetch_add(1, boost::memory_order_relaxed);
    }

    void release() {
      if (ref_count_.fetch_sub(1, boost::memory_order_release) == 1) {
        boost::atomic_thread_fence(boost::memory_order_acquire);
        delete this;
      }
    }

  private:
    boost::atomic<int> ref_count_;
};

void* Thread1(void* x) {
  static_cast<ReferenceCounted*>(x)->release();
  return NULL;
}

void* Thread2(void* x) {
  static_cast<ReferenceCounted*>(x)->release();
  return NULL;
}

int main() {
  ReferenceCounted* obj = new ReferenceCounted();
  obj->reserve(); // for Thread1
  obj->reserve(); // for Thread2
  obj->release(); // for the main()
  pthread_t t[2];
  pthread_create(&t[0], NULL, Thread1, obj);
  pthread_create(&t[1], NULL, Thread2, obj);
  pthread_join(t[0], NULL);
  pthread_join(t[1], NULL);
}

This is somewhat similar to the Reference counting example from Boost.Atomic.

The main differences are that the embedded ref_count_ is initialized to 1 in the constructor (once the constructor is completed we have a single reference to the ReferenceCounted object) and that the code doesn't use boost::intrusive_ptr.

Please don't blame me for using delete this in the code - this is the pattern that I have in a large code base at work and there's nothing I can do about it right now.

Now this code compiled with clang 3.5 from trunk (details below) and ThreadSanitizer (tsan v2) results in the following output from ThreadSanitizer:

WARNING: ThreadSanitizer: data race (pid=9871)
  Write of size 1 at 0x7d040000f7f0 by thread T2:
    #0 operator delete(void*) <null>:0 (a.out+0x00000004738b)
    #1 ReferenceCounted::release() /home/A.Romanek/tmp/tsan/main.cpp:15 (a.out+0x0000000a2c06)
    #2 Thread2(void*) /home/A.Romanek/tmp/tsan/main.cpp:29 (a.out+0x0000000a2833)

  Previous atomic write of size 4 at 0x7d040000f7f0 by thread T1:
    #0 __tsan_atomic32_fetch_sub <null>:0 (a.out+0x0000000896b6)
    #1 boost::atomics::detail::base_atomic<int, int, 4u, true>::fetch_sub(int, boost::memory_order) volatile /home/A.Romanek/tmp/boost/boost_1_55_0/boost/atomic/detail/gcc-atomic.hpp:499 (a.out+0x0000000a3329)
    #2 ReferenceCounted::release() /home/A.Romanek/tmp/tsan/main.cpp:13 (a.out+0x0000000a2a71)
    #3 Thread1(void*) /home/A.Romanek/tmp/tsan/main.cpp:24 (a.out+0x0000000a27d3)

  Location is heap block of size 4 at 0x7d040000f7f0 allocated by main thread:
    #0 operator new(unsigned long) <null>:0 (a.out+0x000000046e1d)
    #1 main /home/A.Romanek/tmp/tsan/main.cpp:34 (a.out+0x0000000a286f)

  Thread T2 (tid=9874, running) created by main thread at:
    #0 pthread_create <null>:0 (a.out+0x00000004a2d1)
    #1 main /home/A.Romanek/tmp/tsan/main.cpp:40 (a.out+0x0000000a294e)

  Thread T1 (tid=9873, finished) created by main thread at:
    #0 pthread_create <null>:0 (a.out+0x00000004a2d1)
    #1 main /home/A.Romanek/tmp/tsan/main.cpp:39 (a.out+0x0000000a2912)

SUMMARY: ThreadSanitizer: data race ??:0 operator delete(void*)
==================
ThreadSanitizer: reported 1 warnings

The strange thing is that thread T1 does a write of size 1 to the same memory location as thread T2 when doing atomic decrement on the reference counter.

How can the former write be explained? Is it some clean-up performed by the destructor of ReferenceCounted class?

It is a false positive? Or is the code wrong?

My setup is:

$ uname -a
Linux aromanek-laptop 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ clang --version
Ubuntu clang version 3.5-1ubuntu1 (trunk) (based on LLVM 3.5)
Target: x86_64-pc-linux-gnu
Thread model: posix

The code is compiled like this:

clang++ main.cpp -I/home/A.Romanek/tmp/boost/boost_1_55_0 -pthread -fsanitize=thread -O0 -g -ggdb3 -fPIE -pie -fPIC

Note that on my machine the implementation of boost::atomic<T> resolves to __atomic_load_n family of functions, which ThreadSanitizer claims to understand.

UPDATE 1: the same happens when using clang 3.4 final release.

UPDATE 2: the same problem occurs with -std=c++11 and <atomic> with both libstdc++ and libc++.

Solution

This looks like a false positive.

The thread_fence in the release() method enforces all outstanding writes from fetch_sub-calls to happen-before the fence returns. Therefore, the delete on the next line cannot race with the previous writes from decreasing the refcount.

Quoting from the book C++ Concurrency in Action:

A release operation synchronizes-with a fence with an order of std::memory_order_acquire [...] if that release operation stores a value that's read by an atomic operation prior to the fence on the same thread as the fence.

Since decreasing the refcount is a read-modify-write operation, this should apply here.

To elaborate, the order of operations that we need to ensure is as follows:

Decreasing the refcount to a value > 1
Decreasing the refcount to 1
Deleting the object

2. and 3. are synchronized implicitly, as they happen on the same thread. 1. and 2. are synchronized since they are both atomic read-modify-write operations on the same value. If these two could race the whole refcounting would be broken in the first place. So what is left is synchronizing 1. and 3..

This is exactly what the fence does. The write from 1. is a release operation that is, as we just discussed, synchronized with 2., a read on the same value. 3., an acquire fence on the same thread as 2., now synchronizes with the write from 1. as guaranteed by the spec. This happens without requiring an addition acquire write on the object (as was suggested by @KerrekSB in the comments), which would also work, but might be potentially less efficient due to the additional write.

Bottom line: Don't play around with memory orderings. Even experts get them wrong and their impact on performance is often negligible. So unless you have proven in a profiling run that they kill your performance and you absolutely positively have to optimize this, just pretend they don't exist and stick with the default memory_order_seq_cst.