memcpy for volatile arrays in gcc C on x86?

I am using the C volatile keyword in combination with x86 memory ordering guarantees (writes are ordered with writes, and reads are ordered with reads) to implement a barrier-free message queue. Does gcc provide a builtin function that efficiently copies data from one volatile array to another?

I.e. is there a builtin/efficient function that we could call as memcpy_volatile is used in the following example?

uint8_t volatile      * dest = ...;
uint8_t volatile const* src  = ...;
int     len;
memcpy_volatile(dest, src, len);

rather than writing a naive loop?

This question is NOT about the popularity of barrier-free C programs. I am perfectly aware of barrier-based alternatives. This question is, therefore, also NOT a duplicate of any question where the answer is "use barrier primitives".

This question is also NOT a duplicate of similar questions that are not specific to x86/gcc, where of course the answer is "there's no general mechanism that works on all platforms".

Additional Detail

memcpy_volatile is not expected to be atomic. The ordering of operations within memcpy_volatile does not matter. What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. So if the other thread sees the new pointer (dest) then it must also see the data that was copied to *dest. This is the essential requirement for barrier-free queue implementation.

Solution

memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...

Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.

The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.

To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  asm("" ::: "memory");
  shared_var = dest;         // release-store

But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.

The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)

Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.

If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  // a release-store is ordered after all previous stuff in this thread
  __atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);

As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)

The key point is that there's no down-side to the generated asm for x86.

As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.