c++concurrency atomic lock-free false-sharing

Avoiding false sharing of SPSC queue indices

Let's imagine a lock-free concurrent SPSC (single-producer / single-consumer) queue.

The producer thread reads head, tail, cached_tail and writes head, cached_tail.
The consumer thread reads head, tail, cached_head and writes tail, cached head.

Note, that cached_tail is accessed only by the producer thread, just like cached_head is accessed only by the consumer thread. They can be thought as private thread local variables, so they are unsynchronized, thus not defined as atomic.

The data layout of the queue is the following:

#include <atomic>
#include <cstddef>
#include <thread>

struct spsc_queue
{
    /// ...

    // Producer variables
    alignas(std::hardware_destructive_interference_size) std::atomic<size_t> head; // shared
    size_t cached_tail; // non-shared

    // Consumer variables
    alignas(std::hardware_destructive_interference_size) std::atomic<size_t> tail; // shared
    size_t cached_head; // non-shared

    std::byte padding[std::hardware_destructive_interference_size - sizeof(tail) - sizeof(cached_head)];
};

Since I want to avoid false sharing, I aligned head and tail to the L1 cache line size.

The pseudo-code-ish implementation of the push/pop operations is the following:

bool push(const void* elems, size_t n)
{
    size_t h = atomic_load(head, relaxed);

    if (num_remaining_storage(h, cached_tail) < n)
    {
        cached_tail = atomic_load(tail, acquire);

        if (num_remaining_storage(h, cached_tail) < n)
            return false;
    }

    // write from elems

    atomic_store(head, h + n, release);
    return true;
}

bool pop(void* elems, size_t n)
{
    size_t t = atomic_load(tail, relaxed);

    if (num_stored_elements(cached_head, t) < n)
    {
        cached_head = atomic_load(head, acquire);

        if (num_stored_elements(cached_head, t) < n)
            return false;
    }

    // read to elems

    atomic_store(tail, t + n, release);
    return true;
}

void wait_and_push(const void* elems, size_t n)
{
    size_t h = atomic_load(head, relaxed);

    while (num_remaining_storage(h, cached_tail) < n)
        cached_tail = atomic_load(tail, acquire);

    // write from elems

    atomic_store(head, h + n, release);
}

void wait_and_pop(void* elems, size_t n)
{
    size_t t = atomic_load(tail, relaxed);

    while (num_stored_elements(cached_head, t) < n)
        cached_head = atomic_load(head, acquire);

    // write to elems

    atomic_store(tail, t + n, release);
}

At initialization (not listed here), all the indices are set to 0. The functions num_remaining_storage and num_stored_elements are const functions performing simple calculations based on the passed arguments and the immutable queue capacity - they do not perform any atomic reads or writes.

Now the question is: do I need to align cached_tail and cached_head as well to completely avoid false sharing any of the indices, or it is okay as it is. Since cached_tail is producer private, and cached_head is consumer private, I think cached_tail can be in the same cache line as head (producer cache line), just like cached_head in the same cache line as tail (consumer cache line) without false sharing to ever happen.

Am I missing something?

Solution

Thank you for providing the pseudocode - it is still lacking some details, but I think I get the basic idea. You have a bounded SPSC queue where the indexes can wrap around, and you use the cached_tail variable in push to check if there are free slots, so you can avoid loading tail from a potentially invalidated cache line (and vice versa for pop).

I would suggest to put head and cached_tail next to each other (i.e., on the same cache line), and tail and cached_head on a different one. push always reads both variables - head and cached_tail, so it makes sense to have them close together. cached_tail is only updated if there are no more free slots and we have to reload tail.

Your code is a bit thin on details, but it seems that there is some room for optimization:

bool push(const void* elems, size_t n)
{
    size_t h = atomic_load(head);

    if (num_remaining_storage(h, cached_tail) < n)
    {
        auto t = atomic_load(tail);
        if (t == cached_tail)
          return false;

       // we only have to update our cached_tail if the reloaded value
       // is different - and in this case it is guaranteed that there
       // is a free slot, so we don't have to perform a recheck.
        cached_tail = t;
    }

    // write from elems

    atomic_store(head, h + n);
    return true;
}

That way cached_tail is only ever updated when head is also updated, so this is another reason for them to be on the same cache line. Of course the same kind of optimization can be applied to pop as well.

That is exactly why I wanted to see some code, because the access pattern is crucial to determine which variables should share a cache line and which should not.