Optimal way to pass a few variables between 2 threads pinning different CPUs

I have a problem that I need to understand if there is a better solution. I have written the following code to pass a few variables from a writer thread to a reader thread. These threads pinned to different CPUs sharing the same L2 cache (disabled hyperthreading).

writer_thread.h

struct a_few_vars {
    uint32_t x1;
    uint32_t x2;

    uint64_t x3;
    uint64_t x4;
} __attribute__((aligned(64)));

volatile uint32_t head;
struct a_few_vars xxx[UINT16_MAX] __attribute__((aligned(64)));

reader_thread.h

uint32_t tail;
struct a_few_vars *p_xxx;

The writer thread increases the head variable and the reader thread checks whether the head variable and the tail is equal. If they are not equal then it reads the new data as follows

while (true) {
    if (tail != head) {
        .. process xxx[head] ..
        .. update tail ..
    }
}

Performance is by far the most important issue. I'm using Intel Xeon processors and the reader thread fetches the head value and the xxx[head] data from memory each time. I used the aligned array to be lock free

In my case, is there any method to flush the variables to the reader CPU cache as soon as possible. Can I trigger a prefetch for the reader CPU from writer CPU. I can use special Intel instructions using __asm__ if exist. In conclusion, what is the fastest way to pass the variables in the struct between threads pinning to different CPUs?

Thanks in advance

Solution

It's undefined behaviour for one thread to write a volatile variable while another thread reads it, according to C11. volatile accesses are also not ordered with respect to other accesses. You want atomic_store_explicit(&head, new_value, memory_order_release) in the writer and atomic_load_explicit(&head, memory_order_acquire) in the reader to create acq/rel synchronization, and force the compiler to make the stores into your struct visible before the store to head which indicates to the reader that there's new data.

(tail is private to the reader thread, so there's no mechanism for the writer to wait for the reader to have seen new data before writing more. So technically there's a possible race on the struct contents if the writer thread writes again while the reader is still reading. So the struct should also be _Atomic).

You might want a seq-lock where the writer updates a sequence number and the reader checks it before and after copying out the variables. https://en.wikipedia.org/wiki/Seqlock This lets you detect and retry in the rare cases where the writer was in the middle of an update when the reader copied the data.

It's pretty good for write-only / read-only situations, especially if you don't need to worry about the reader missing an update.

See my attempt at a SeqLock in C++11: Implementing 64 bit atomic counter with 32 bit atomics and also how to implement a seqlock lock using c++11 atomic library

And GCC reordering up across load with `memory_order_seq_cst`. Is this allowed? shows another example (this one causes a gcc bug).

Porting these from C++11 std::atomic to C11 stdatomic should be straightforward. Make sure to use atomic_store_explicit, because the default memory ordering for plain atomic_store is memory_order_seq_cst which is slower.

Not much you can do will actually speed up the writer making its stores globally visible. A CPU core already commits stores from its store buffer to its L1d as quickly as possible (obeying the restrictions of the x86 memory model, which doesn't allow StoreStore reordering).

On a Xeon, see When CPU flush value in storebuffer to L1 Cache? for some info about different Snoop Modes and their effect on inter-core latency within a single socket.

The caches on multiple cores are coherent, using MESI to maintain coherency.

A reader spin-waiting on an atomic variable is probably the best you can do, using _mm_pause() inside the spin loop to avoid a memory-order mis-speculation pipeline clear when exiting the spin-loop.

You also don't want to wake up in the middle of a write and have to retry. You might want to put the seq-lock counter in the same cache line as the data, so those stores can hopefully be merged in the store buffer of the writing core.