lock-free bounded MPMC ringbuffer failure

I've been banging my head against (my attempt) at a lock-free multiple producer multiple consumer ring buffer. The basis of the idea is to use the innate overflow of unsigned char and unsigned short types, fix the element buffer to either of those types, and then you have a free loop back to beginning of the ring buffer.

The problem is - my solution doesn't work for multiple producers (it does though work for N consumers, and also single producer single consumer).

#include <atomic>

template<typename Element, typename Index = unsigned char> struct RingBuffer
{
  std::atomic<Index> readIndex;
  std::atomic<Index> writeIndex;
  std::atomic<Index> scratchIndex;
  Element elements[1 << (sizeof(Index) * 8)];

  RingBuffer() :
    readIndex(0),
    writeIndex(0),
    scratchIndex(0)
  {
    ;
  }

  bool push(const Element & element)
  {
    while(true)
    {
      const Index currentReadIndex = readIndex.load();
      Index currentWriteIndex = writeIndex.load();
      const Index nextWriteIndex = currentWriteIndex + 1;
      if(nextWriteIndex == currentReadIndex)
      {
        return false;
      }

      if(scratchIndex.compare_exchange_strong(
        currentWriteIndex, nextWriteIndex))
      {
        elements[currentWriteIndex] = element;
        writeIndex = nextWriteIndex;
        return true;
      }
    }
  }

  bool pop(Element & element)
  {
    Index currentReadIndex = readIndex.load();

    while(true)
    {
      const Index currentWriteIndex = writeIndex.load();
      const Index nextReadIndex = currentReadIndex + 1;
      if(currentReadIndex == currentWriteIndex)
      {
        return false;
      }

      element = elements[currentReadIndex];

      if(readIndex.compare_exchange_strong(
        currentReadIndex, nextReadIndex))
      {
        return true;
      }
    }
  }
};

The main idea for writing was to use a temporary index 'scratchIndex' that acts a pseudo-lock to allow only one producer at any one time to copy-construct into the elements buffer, before updating the writeIndex and allowing any other producer to make progress. Before I am called heathen for implying my approach is 'lock-free' I realise that this approach isn't exactly lock-free, but in practice (if it would work!) it is significantly faster than having a normal mutex!

I am aware of a (more complex) MPMC ringbuffer solution here http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue, but I am really experimenting with my idea to then compare against that approach and find out where each excels (or indeed whether my approach just flat out fails!).

Things I have tried;

Using compare_exchange_weak
Using more precise std::memory_order's that match the behaviour I want
Adding cacheline pads between the various indices I have
Making elements std::atomic instead of just Element array

I am sure that this boils down to a fundamental segfault in my head as to how to use atomic accesses to get round using mutex's, and I would be entirely grateful to whoever can point out which neurons are drastically misfiring in my head! :)

Solution

This is a form of the A-B-A problem. A successful producer looks something like this:

load currentReadIndex
load currentWriteIndex
cmpxchg store scratchIndex = nextWriteIndex
store element
store writeIndex = nextWriteIndex

If a producer stalls for some reason between steps 2 and 3 for long enough, it is possible for the other producers to produce an entire queue's worth of data and wrap back around to the exact same index so that the compare-exchange in step 3 succeeds (because scratchIndex happens to be equal to currentWriteIndex again).

By itself, that isn't a problem. The stalled producer is perfectly within its rights to increment scratchIndex to lock the queue—even if a magical ABA-detecting cmpxchg rejected the store, the producer would simply try again, reload exactly the same currentWriteIndex, and proceed normally.

The actual problem is the nextWriteIndex == currentReadIndex check between steps 2 and 3. The queue is logically empty if currentReadIndex == currentWriteIndex, so this check exists to make sure that no producer gets so far ahead that it overwrites elements that no consumer has popped yet. It appears to be safe to do this check once at the top, because all the consumers should be "trapped" between the observed currentReadIndex and the observed currentWriteIndex.

Except that another producer can come along and bump up the writeIndex, which frees the consumer from its trap. If a producer stalls between steps 2 and 3, when it wakes up the stored value of readIndex could be absolutely anything.

Here's an example, starting with an empty queue, that shows the problem happening:

Producer A runs steps 1 and 2. Both loaded indices are 0. The queue is empty.
Producer B interrupts and produces an element.
Consumer pops an element. Both indices are 1.
Producer B produces 255 more elements. The write index wraps around to 0, the read index is still 1.
Producer A awakens from its slumber. It had previously loaded both read and write indices as 0 (empty queue!), so it attempts step 3. Because the other producer coincidentally paused on index 0, the compare-exchange succeeds, and the store progresses. At completion the producer lets writeIndex = 1, and now both stored indices are 1, and the queue is logically empty. A full queue's worth of elements will now be completely ignored.

(I should mention that the only reason I can get away with talking about "stalling" and "waking up" is that all the atomics used are sequentially consistent, so I can pretend that we're in a single-threaded environment.)

Note that the way that you are using scratchIndex to guard concurrent writes is essentially a lock; whoever successfully completes the cmpxchg gets total write access to the queue until it releases the lock. The simplest way to fix this failure is to just replace scratchIndex with a spinlock—it won't suffer from A-B-A and it's what's actually happening.