c++multithreading performance stdatomic stdmutex

Relative performance of std::atomic and std::mutex

I am considering options for implementation of a queue for a project, a requirement of which is that the producer, at least, must be as low latency as possibe. To this end, I have been investigating "lock free" queues using std::atomic to control access to the data structure by the producer and consumer threads. It was my hope that this would avoid overheads in std::mutex, and specifically std::unique_lock, which the code currently uses.

To this end, I have written a simple test program to assess the relative performance of std::mutex (coupled with std::unique_lock) and std::atomic. The program also does a check to ensure the atomic object is lock free, which it is.

#include <mutex>
#include <atomic>
#include <thread>
#include <chrono>
#include <iostream>

#define CYCLES 100000000

void testAtomic()
{
   bool var(true);
   std::atomic_bool _value(true);

   std::cout << "atomic bool is ";

   if(!_value.is_lock_free())
      std::cout << "not ";
   std::cout << "lock free" << std::endl;
   const auto _start_time = std::chrono::high_resolution_clock::now();

   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      var = _value.load();
      var = !var;
      _value.store(var);
   }

   const auto _end_time = std::chrono::high_resolution_clock::now();

   std::cout << 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>
      (_end_time - _start_time).count() << " s" << std::endl;
}

void testMutex()
{
   bool var(true);
   std::mutex _mutex;
   std::chrono::high_resolution_clock _clock;

   const auto _start_time = std::chrono::high_resolution_clock::now();

   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      std::unique_lock<std::mutex> lock(_mutex);
      var = !var;
   }

   const auto _end_time = std::chrono::high_resolution_clock::now();

   std::cout << 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>
      (_end_time - _start_time).count() << " s" << std::endl;
}

int main()
{
   std::thread t1(testAtomic);
   t1.join();
   std::thread t2(testMutex);
   t2.join();
   return 0;
}

When running this program, I get the following output:

atomic bool is lock free
3.49434 s 
2.31755 s

This would indicate to me that std::mutex (and std::unique_lock) is significantly faster, which is the opposite of what I've come to expect from reading up about atomics vs mutexes. Are my findings correct? Is there a problem with my test program? Is my understanding about the differences between the two incorrect?

The code was compiled with GCC 4.8.5 on CentOS7

Solution

What hardware are you testing on?

Since you're using GCC, the std::atomic seq_cst store will be using a mov + slow mfence instead of a somewhat less slow xchg-with-mem (which is also a full barrier, like all other x86 atomic RMW operations).

Taking a mutex costs an atomic RMW (like xchg, not mov + mfence). And if you're lucky release the mutex could be just a plain store (like mo_release). There is zero contention so acquiring the lock always succeeds.

It's certainly plausible that the the code behind those mutex lock / unlock library functions is less expensive than mfence, especially on Skylake CPUs with updated microcode where mfence is a full barrier for out-of-order execution as well as memory. (See the bottom of this answer, and also Does lock xchg have the same behavior as mfence?)

Also, note that your mutex loop optimized the local bool var into a register and isn't actually updating it in memory inside the loop. (Your code on the Godbolt compiler explorer with gcc4.8.5).

# the main loop from testMutex
.L80:                                                       # do {
        mov     rdi, rsp                                      # pointer to _mutex on the stack
        call    __gthrw_pthread_mutex_lock(pthread_mutex_t*)
        test    eax, eax
        jne     .L91                                          # mutex error handling
        mov     rdi, rsp                                      # pointer to _mutex again
        call    __gthrw_pthread_mutex_unlock(pthread_mutex_t*)
        sub     rbx, 1
        jne     .L80                                        # }while(--counter)

xor bl, 1 inside the loop would be irrelevant; out-of-order exec could overlap that with other work.

If a reference to var escaped the function so the compiler had to have it in sync in memory before non-inline function calls (including to pthread library functions), we'd expect something like xor byte ptr [rsp+8], 1. That would also be pretty cheap and perhaps mostly hidden by out-of-order exec, although the load/ALU/store could be something that a full barrier would have to wait for when draining the store buffer.

Speeding up your `std::atomic` code:

You're intentionally avoiding doing an atomic RMW, it seems, instead loading into a tmp var and doing a separate store. If you use only release instead of seq_cst, that lets it compile to just a plain store instruction on x86. (Or to cheaper barriers on most other ISAs).

      bool tmp = _value.load(std::memory_order_relaxed);   // or acquire
      _value.store(!tmp, std::memory_order_release);

This should run at about 6 cycles per inversion, just the latency of one ALU operation plus store-forwarding latency for the store/reload. vs. maybe 33 cycles per iteration for the best-case throughput for mfence (https://uops.info/).

Or since this is a non-atomic modification, just store alternating values without re-reading the old value. You can usually only avoid an atomic RMW in cases where only one producer is writing a value, and other threads are reading. So let the producer keep the value it's modifying in a register (non-atomic local var), and store copies if it.

   bool var = true;
   for(size_t counter = 0; counter < CYCLES; counter++)
   {
      var = !var;
      _value.store(var, std::memory_order_release);
   }

Also, don't use leading underscores for your own var names. Such names are reserved for the implementation. (single _ with lowercase is only reserved at file / global scope, but it's still bad practice.)

Relative performance of std::atomic and std::mutex

Speeding up your std::atomic code:

Speeding up your `std::atomic` code: