queue cpu-architecture cpu-cache micro-optimization lock-free

what is the purpose of using index caches in rigtorp's SPSCQueue

I am reading the implementation of rigtorp's SPSCQueue, which is a very elegant design and have very good benchmark.

I understand most of the philosophy of design as described in the README. What I could not understand is how the writeIdxCache_ and readIdxCache_ helps.

Looking at the following function for fetching an item from the queue

case 0: original implementation

  RIGTORP_NODISCARD T *front() noexcept {
    auto const readIdx = readIdx_.load(std::memory_order_relaxed);
    if (readIdx == writeIdxCache_) {
      writeIdxCache_ = writeIdx_.load(std::memory_order_acquire);
      if (writeIdxCache_ == readIdx) {
        return nullptr;
      }
    }
    return &slots_[readIdx + kPadding];
  }

If I compare the above implementation with the following ones, what is the benefit?

case 1: not using the cache

What is the benefit of using the writeIdxCache_ in the first place

  RIGTORP_NODISCARD T *front() noexcept {
    auto const readIdx = readIdx_.load(std::memory_order_relaxed);
    if (readIdx == writeIdx_.load(std::memory_order_acquire)) {
        return nullptr;
    }
    return &slots_[readIdx + kPadding];
  }

case 2: directly set `readIdxCache_`

Here I directly set the readIdxCache_ as well when loading it. Is this approach still working, or it would break the queue? Is it better or worse than case 0?

  RIGTORP_NODISCARD T *front() noexcept {
    readIdxCache_ = readIdx_.load(std::memory_order_relaxed);
    if (readIdxCache_ == writeIdxCache_) {
      writeIdxCache_ = writeIdx_.load(std::memory_order_acquire);
      if (writeIdxCache_ == readIdxCache_) {
        return nullptr;
      }
    }
    return &slots_[readIdx + kPadding];
  }

Solution

For a CPU to commit a store to a cache line, it needs exclusive ownership of that cache line. (MESI Modified or Exclusive state). If one thread is writing a shared variable and no other threads are reading it, things are efficient and the cache line containing it can stay in that state.

But if another thread so much as reads it, its MESI Share Request will transition the cache line from Modified to Shared, so the writer will need a Read For Ownership before it can commit its next store. Also, the thread doing the reading will have to wait for that cache miss (share request) on its loads, if the writer has written since its last read. (A Read For Ownership invalidates all other copies of the cache line in other cores.)

Without the index caches, every call to front() reads both shared variables, including the one the other thread sometimes writes. This creates contention on writeIdx_ as cache-coherency messages for it have to get sent for most reads and most writes. (And similarly for the writer creating contention on readIdx_ by reading it every time.)

With the index caches, front() only touches writeIdx_ occasionally, only after all the queue entries it saw last time have been read. And the writers similarly avoids touching readIdx_ often, so if you're removing entries from the queue, writes to readIdx_ can be efficient.

what is the purpose of using index caches in rigtorp's SPSCQueue

case 0: original implementation

case 1: not using the cache

case 2: directly set readIdxCache_

case 2: directly set `readIdxCache_`