Search code examples
cgccatomicqemu

Why does QEMU use __atomic_thread_fence() together with barrier()?


QEMU atomic.h has these definitions:

#define smp_mb()                     ({ barrier(); __atomic_thread_fence(__ATOMIC_SEQ_CST); })
#define smp_mb_release()             ({ barrier(); __atomic_thread_fence(__ATOMIC_RELEASE); })
#define smp_mb_acquire()             ({ barrier(); __atomic_thread_fence(__ATOMIC_ACQUIRE); })

And it has comments explaining why barrier(), a compiler barrier, is necessary:

/* Manual memory barriers
 *
 *__atomic_thread_fence does not include a compiler barrier; instead,
 * the barrier is part of __atomic_load/__atomic_store's "volatile-like"
 * semantics. If smp_wmb() is a no-op, absence of the barrier means that
 * the compiler is free to reorder stores on each side of the barrier.
 * Add one here, and similarly in smp_rmb() and smp_read_barrier_depends().
 */

I haven't used __atomic_thread_fence before, but my searches on the net show that __atomic_thread_fence prevents both compiler and CPU from reordering memory access. For example, its reference page here and here doesn't say it's only a CPU barrier. And an answer here says explicitly that it's both a compiler barrier and CPU barrier.

Does that mean barrier() in those definitions is redundant? (I'm just curious)


Solution

  • It's redundant for smp_mb: __atomic_thread_fence(__ATOMIC_SEQ_CST); doesn't let any operations reorder in either direction. But does no harm so might as well leave it in for consistency.

    It's not redundant with RELEASE or ACQUIRE fences. On paper, even ACQ_REL fences allow reordering earlier stores with later loads (StoreLoad). So the compiler is allowed to do that at compile time, as well as not emitting instructions to stop it from happening at run-time.

    But the Linux kernel's definitions of smp_rmb() and smp_wmb() are in terms of asm("..." ::: "memory") GNU C inline asm which blocks all compile-time reordering.
    Linux barrier() is defined as asm("" ::: "memory").


    In practice, GCC probably treats any __atomic_thread_fence as a full compiler barrier; see Does gcc treat relaxed atomic operation as a Compiler-fence? - GCC currently won't even optimize increment of the same variable before and after a relaxed operation. But Clang will optimize.

    Practical demo of the difference

    int read_twice(int* x) {
      int tmp = *x;
        //barrier();
        __atomic_thread_fence(__ATOMIC_RELEASE); // Doesn't block LoadLoad
      tmp += *x;
      return tmp;
    }
    

    The latest GCC loads twice.
    Clang correctly optimizes it to a single load without barrier(), but can't with it. (Godbolt)

    # x86-64 clang 19, NO barrier()
    read_twice(int*):
            mov     eax, dword ptr [rdi]
            add     eax, eax
            ret
    
    # x86-64 clang 19, WITH barrier()
    read_twice_barrier(int*):
            mov     eax, dword ptr [rdi]
            add     eax, dword ptr [rdi]
            ret
    

    Obviously this is a silly example where the barrier makes no sense, but keep in mind that optimizations are possible after inlining small functions.

    Code that would break without barrier() is probably already unsafe, e.g. probably using non-atomic (and non-volatile) accesses to shared variables without synchronization. In code that uses fences properly (and/or atomic loads with appropriate memory orders), optimizations allowed without barrier() will still be safe.

    See also Who's afraid of a big bad optimizing compiler? re: the perils of plain accesses to shared data: as well as the obvious pitfalls, there can be subtle effects like invented loads where a temporary is optimized away and the compiler reloads the shared data.

    But anyway, for full belt-and-suspenders strict compatibility with the Linux kernel smp_* memory barrier functions, blocking all compile-time reordering across them is correct.


    Related: