c++multithreading c++11 atomic lock-free

Lock free single producer/single consumer circular buffer - Can CPU speculation break the memory barrier logic?

I have been looking at a lock free single producer/single consumer circular buffer when I thought about speculative execution and its effect on the simple code.

With this implementation, there is only a unique thread which can call the push() function and another unique thread which can call the pop() function.

Here is the Producer code:

bool push(const Element& item)
{       
  const auto current_tail = _tail.load(std::memory_order_relaxed);  //(1)
  const auto next_tail = increment(current_tail);

  if(next_tail != _head.load(std::memory_order_acquire))            //(2)               
  {     
    _array[current_tail] = item;                                    //(3)
    _tail.store(next_tail, std::memory_order_release);              //(4)
    return true;
  }
  return false; // full queue
}

Here is the Consumer code:

bool pop(Element& item)
{
  const auto current_head = _head.load(std::memory_order_relaxed);    //(1)
  if(current_head == _tail.load(std::memory_order_acquire))           //(2)
    return false; // empty queue

  item = _array[current_head];                                       //(3)
  _head.store(increment(current_head), std::memory_order_release);   //(4)
  return true;
}

The Question

What if the push() would be compiled as the following function due to speculative execution:

bool push(const Element& item)
{       
  const auto current_tail = _tail.load(std::memory_order_relaxed);  // 1
  const auto next_tail = increment(current_tail);

  //The load is performed before the test, it is valid
  const auto head = _head.load(std::memory_order_acquire);         

  //Here is the speculation, the CPU speculate that the test will succeed
  //store due to speculative execution AND it respects the memory order due to read-acquire
  _array[current_tail] = item;                             
  _tail.store(next_tail, std::memory_order_release); 

  //Note that in this case the test checks if you it has to restore the memory back
  if(next_tail == head)//the code was next_tail != _head.load(std::memory_order_acquire)    
  { 
   //We restore the memory back but the pop may have been called before and see an invalid memory
    _array[current_tail - 1] = item;                                 
    _tail.store(next_tail - 1, std::memory_order_release);             
    return true;
  }
  return false; // full queue
}

To me, to be perfectly valid the push function should make sure the barrier is issued after the condition success:

bool push(const Element& item)
{       
  const auto current_tail = _tail.load(std::memory_order_relaxed);  // 1
  const auto next_tail = increment(current_tail);                   
  if(next_tail != _head.load(std::memory_order_relaxed))            // 2               
  { 
    //Here we are sure that nothing can be reordered before the condition
    std::atomic_thread_fence(std::memory_order_acquire);            //2.1
    _array[current_tail] = item;                                    // 3
    _tail.store(next_tail, std::memory_order_release);              // 4
    return true;
  }
  return false; // full queue
}

Solution

re: your proposed reordering: no, the compiler can't invent writes to atomic variables.

Runtime speculation also can't invent writes that actually become visible to other threads. It can put whatever it wants in its own private store buffer, but the correctness of earlier branches must be checked before a store can become visible to other threads.

Normally this works by in-order retirement: an instruction can only retire (become non-speculative) once all previous instructions are retired/non-speculative. A store can't commit from the store buffer to L1d cache until after the store instruction retires.

re: the title: no, speculative execution still has to respect the memory model. If a CPU wants to speculatively load past an incomplete acquire-load, it can, but only if it checks to make sure those load results are still valid when they're "officially" allowed to happen.

x86 CPUs do in practice do this, because the strong x86 memory model means that all loads are acquire-loads, so any out-of-order loading has to be speculative and rolled back if it's not valid. (This is why you can get memory-order mis-speculation pipeline nukes.)

So asm works the way the ISA rules say it works, and C++ compilers know that. Compilers use this to implement the C++ memory model on top of the target ISA.

If you do an acquire-load in C++, it really works as an acquire-load.

You can mentally model your logic for possible compile-time + run-time reordering according to the C++ reordering rules as written. See http://preshing.com/20120913/acquire-and-release-semantics/.