c multithreading atomic c11 memory-fences

C11 memory fence usage

Even for a simple 2-thread communication example, I have difficulty to express this in the C11 atomic and memory_fence style to obtain proper memory ordering:

shared data:

volatile int flag, bucket;

producer thread:

while (true) {
   int value = producer_work();
   while (atomic_load_explicit(&flag, memory_order_acquire))
      ; // busy wait
   bucket = value;
   atomic_store_explicit(&flag, 1, memory_order_release);
}

consumer thread:

while (true) {
   while (!atomic_load_explicit(&flag, memory_order_acquire))
      ; // busy wait
   int data = bucket;
   atomic_thread_fence(/* memory_order ??? */);
   atomic_store_explicit(&flag, 0, memory_order_release);
   consumer_work(data);
}

As far as I understand, above code would properly order the store-in-bucket -> flag-store -> flag-load -> load-from-bucket. However, I think that there remains a race condition between load-from-bucket and re-write the bucket again with new data. To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store. Unfortunately, there seems to be no memory_order argument to enforce anything on preceding loads, not even the memory_order_seq_cst.

A really dirty solution could be to re-assign bucket in the consumer thread with a dummy value: that contradicts the consumer read-only notion.

In the older C99/GCC world I could use the traditional __sync_synchronize() which I believe would be strong enough.

What would be the nicer C11-style solution to synchronize this so-called anti-dependency?

(Of course I am aware that I should better avoid such low-level coding and use available higher-level constructs, but I would like to understand...)

Solution

To force an order following the bucket-read, I guess I would need an explicit atomic_thread_fence() between the bucket read and the following atomic_store.

I do not believe the atomic_thread_fence() call is necessary: the flag update has release semantics, preventing any preceding load or store operations from being reordered across it. See the formal definition by Herb Sutter:

A write-release executes after all reads and writes by the same thread that precede it in program order.

This should prevent the read of bucket from being reordered to occur after the flag update, regardless of where the compiler chooses to store data.

That brings me to your comment about another answer:

The volatile ensures that there are ld/st operations generated, which can subsequently be ordered with fences. However, data is a local variable, not volatile. The compiler will probably put it in register, avoiding a store operation. That leaves the load from bucket to be ordered with the subsequent reset of flag.

It would seem that is not an issue if the bucket read cannot be reordered past the flag write-release, so volatile should not be necessary (though it probably doesn't hurt to have it, either). It's also unnecessary because most function calls (in this case, atomic_store_explicit(&flag)) serve as compile-time memory barriers. The compiler would not reorder the read of a global variable past a non-inlined function call because that function could modify the same variable.

I would also agree with @MaximYegorushkin that you could improve your busy-waiting with pause instructions when targeting compatible architectures. GCC and ICC both appear to have _mm_pause(void) intrinsics (probably equivalent to __asm__ ("pause;")).