Let's imagine a lock-free concurrent SPSC (single-producer / single-consumer) queue.
head
, tail
, cached_tail
and writes head
, cached_tail
.head
, tail
, cached_head
and writes tail
, cached head
.Note, that cached_tail
is accessed only by the producer thread, just like cached_head
is accessed only by the consumer thread. They can be thought as private thread local variables, so they are unsynchronized, thus not defined as atomic.
The data layout of the queue is the following:
#include <atomic>
#include <cstddef>
#include <thread>
struct spsc_queue
{
/// ...
// Producer variables
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> head; // shared
size_t cached_tail; // non-shared
// Consumer variables
alignas(std::hardware_destructive_interference_size) std::atomic<size_t> tail; // shared
size_t cached_head; // non-shared
std::byte padding[std::hardware_destructive_interference_size - sizeof(tail) - sizeof(cached_head)];
};
Since I want to avoid false sharing, I aligned head
and tail
to the L1 cache line size.
The pseudo-code-ish implementation of the push
/pop
operations is the following:
bool push(const void* elems, size_t n)
{
size_t h = atomic_load(head, relaxed);
if (num_remaining_storage(h, cached_tail) < n)
{
cached_tail = atomic_load(tail, acquire);
if (num_remaining_storage(h, cached_tail) < n)
return false;
}
// write from elems
atomic_store(head, h + n, release);
return true;
}
bool pop(void* elems, size_t n)
{
size_t t = atomic_load(tail, relaxed);
if (num_stored_elements(cached_head, t) < n)
{
cached_head = atomic_load(head, acquire);
if (num_stored_elements(cached_head, t) < n)
return false;
}
// read to elems
atomic_store(tail, t + n, release);
return true;
}
void wait_and_push(const void* elems, size_t n)
{
size_t h = atomic_load(head, relaxed);
while (num_remaining_storage(h, cached_tail) < n)
cached_tail = atomic_load(tail, acquire);
// write from elems
atomic_store(head, h + n, release);
}
void wait_and_pop(void* elems, size_t n)
{
size_t t = atomic_load(tail, relaxed);
while (num_stored_elements(cached_head, t) < n)
cached_head = atomic_load(head, acquire);
// write to elems
atomic_store(tail, t + n, release);
}
At initialization (not listed here), all the indices are set to 0
.
The functions num_remaining_storage
and num_stored_elements
are const
functions performing simple calculations based on the passed arguments and the immutable queue capacity - they do not perform any atomic reads or writes.
Now the question is: do I need to align cached_tail
and cached_head
as well to completely avoid false sharing any of the indices, or it is okay as it is. Since cached_tail
is producer private, and cached_head
is consumer private, I think cached_tail
can be in the same cache line as head
(producer cache line), just like cached_head
in the same cache line as tail
(consumer cache line) without false sharing to ever happen.
Am I missing something?
Thank you for providing the pseudocode - it is still lacking some details, but I think I get the basic idea. You have a bounded SPSC queue where the indexes can wrap around, and you use the cached_tail
variable in push
to check if there are free slots, so you can avoid loading tail
from a potentially invalidated cache line (and vice versa for pop
).
I would suggest to put head
and cached_tail
next to each other (i.e., on the same cache line), and tail
and cached_head
on a different one. push
always reads both variables - head
and cached_tail
, so it makes sense to have them close together. cached_tail
is only updated if there are no more free slots and we have to reload tail
.
Your code is a bit thin on details, but it seems that there is some room for optimization:
bool push(const void* elems, size_t n)
{
size_t h = atomic_load(head);
if (num_remaining_storage(h, cached_tail) < n)
{
auto t = atomic_load(tail);
if (t == cached_tail)
return false;
// we only have to update our cached_tail if the reloaded value
// is different - and in this case it is guaranteed that there
// is a free slot, so we don't have to perform a recheck.
cached_tail = t;
}
// write from elems
atomic_store(head, h + n);
return true;
}
That way cached_tail
is only ever updated when head
is also updated, so this is another reason for them to be on the same cache line. Of course the same kind of optimization can be applied to pop
as well.
That is exactly why I wanted to see some code, because the access pattern is crucial to determine which variables should share a cache line and which should not.