Search code examples
llvmatomic

LLVM - Atomic Ordering Unordered


I'm working on a library that heavily relies on bitwise operations, where the most important operate on shared memory. I was also having a look at LLVM's atomic ordering documentation and noticed unordered, which seems to be even weaker than C/C++'s relaxed memory order. I have several questions about it:

  • What are the differences between unordered and relaxed?
  • Say I have an atomic bool, is it safe to mutate it via unordered load/store?
  • Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?
  • Is it safe to mutate it via unordered fetch_and/or/xor?
  • Is it safe to mutate it via unordered swap?
  • Is it safe to mutate it via unordered compare_and_swap?

Solution

  • The short answer is: whatever your problem is, unordered is screamingly unlikely to be the solution !

    The longer answer is...


    ...the LLVM Language Reference Manual says:

    unordered

    The set of values that can be read is governed by the happens-before partial order. A value cannot be read unless some operation wrote it. This is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables. This ordering cannot be specified for read-modify-write operations; it is not strong enough to make them atomic in any interesting way.

    The "A value cannot be read unless some operation wrote it." is fun ! What this means is that “speculative” writes are not allowed. Say you have if (y == 99) x = 0 ; else x = y+1 ;: an optimizer could turn that into x = y+1 ; if (y == 99) x = 0 ; where the first write of x is the “speculative” one. (I'm not saying that's a sensible or common optimization. The point is that transformations which are perfectly OK from the perspective of a single thread, are not OK for atomics.) The C/C++ standards have the same restriction: no “out-of-thin-air” values are allowed.

    Elsewhere the LLVM documentation describes unordered as loads/stores which complete without interruption from any other store -- so a load which reads two halves (say) of a value would not qualify if the two halves could be the result of two separate writes !

    It seems monotonic is the local name for C/C++ memory_order_relaxed, and is described:

    monotonic

    In addition to the guarantees of unordered, there is a single total order for modifications by monotonic operations on each address. All modification orders must be compatible with the happens-before order.

    Unlike unordered, with monotonic all threads will see writes to a given address in the same order. This means that if thread 'a' writes '1' to a given location, and afterwards thread 'b' writes '2', then after that threads 'c' and 'd' must both read '2'. (Hence the name.)

    There is no guarantee that the modification orders can be combined to a global total order for the whole program (and this often will not be possible).

    This is the relaxed bit.

    The read in an atomic read-modify-write operation (cmpxchg and atomicrmw) reads the value in the modification order immediately before the value it writes.

    Same like C/C++: read-modify-write cannot be interrupted by another write.

    If one atomic read happens before another atomic read of the same address, the later read must see the same value or a later value in the address’s modification order. This disallows reordering of monotonic (or stronger) operations on the same address.

    It's monotonic, guys.

    If an address is written monotonic-ally by one thread, and other threads monotonic-ally read that address repeatedly, the other threads must eventually see the write.

    So... there may be some latency between a write and reads, but that latency is finite (but may, I assume, not be the same for all threads). The "eventually" is interesting. The C11 standard says "Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.", which is similar but with a more positive spin :-)

    This corresponds to the C++0x/C1x memory_order_relaxed.

    So there you go.


    You asked:

    • Say I have an atomic bool, is it safe to mutate it via unordered load/store?
    • Say I have an atomic bitmask, is it safe to mutate it via unordered load/store?

    It rather depends on what you mean by safe :-( With any atomic load of 'x' followed later by an atomic store of 'x', you have no idea how many other stores to 'x' there have been since the load. But, the added joy of unordered is that it does not guarantee that all threads will see all writes to 'x' in the same order !

    Your other questions are moot, because you cannot have unordered read-modify-write operations.

    As a practical matter, your x86_64 guarantees that all writes are visible to all threads in the same order -- so all atomic operations are at the very least monotonic/memory_order_relaxed -- no LOCK prefixes and no xFENCE instructions are required, plain MOV to memory will do the trick. In fact, better than that, plain MOV to/from memory gives memory_order_release/memory_order_acquire.


    FWIW: you mention bit-wise operations. I guess it's obvious that reading/writing a few bits is going to involve reading/writing some number of other bits as a side-effect. Which adds to the fun.

    My guess is that at a minimum you will need to be doing read-modify-write operations. Again, on x86_64 this boils down to an instruction with a LOCK prefix, which is going to cost 10's of clocks -- how many 10's depends on the CPU and degree of contention. Now, a mutex lock and unlock will each involve a LOCKed, so it's generally worth replacing a mutex_lock/...do some reading and writing.../mutex_unlock sequence by an atomic read-modify-write(s) really only if it's just one read-modify-write.

    The disadvantage of a mutex is, of course, that a thread can be "swapped out" while it holds the mutex :-(

    A spin-lock (on x86_64) requires a LOCK to acquire but not to release... but the effect of being "swapped out" while holding a spin-lock is even worse :-(