assembly concurrency x86 arm undefined-behavior

Are data races in assembly dangerous?

I know that a data race in C for example is undefined behaviour. But are data races an issue at the hardware level?

If I were to write a programm in assembly where one thread writes to a certain address while another continuosly reads it using simple mov instructions, could the read cause any issues beyond reading garbage? Is it a problem if the read and write are different sizes, but overlap? Is it any different between x86 and ARM?

The reason I am not concerned about reading garbage is that the section in question would be guarded by a seqlock and the result of a conflicting read discarded. According to dl.acm.org/doi/abs/10.1145/2247684.2247688 atomics are still necessary in C++ due to data races being UB, but I think this could be avoided in assembly.

Solution

Correct, in assembly language on modern architectures, the only possible "bad" result from unsynchronized loads and stores is that you might read an "incorrect" value (i.e. the load may return a value that was never stored).

One way to think about this is that the architectural effects of a load instruction only involve writing the destination register, or faulting in case of an illegal address (protection violation, page fault, etc). Faulting is a function only of the address and the contents of page tables, etc, and so isn't affected by concurrent access. Thus if you don't care about the final contents of the destination register (as in the case of a seqlock) there are no other risks.

In fact, a general rule is that ordinary loads and stores of properly aligned objects of machine-word size or smaller are automatically atomic, essentially having the semantics of C++ memory_order_relaxed or better. (On x86 they even have acquire/release semantics.) So you're guaranteed to read a value that was actually stored, and loads and stores of the same object will observe each other in a manner consistent with program order. For example, if thread 1 does str [x], 3 / str [x], 4 and thread 2 does ldr r0, [x] / ldr r1, [x] then it is not possible to end up with r0 == 4 && r1 == 3. (Note that weakly ordered architectures like ARM make no such promise for accesses to different objects.)

(You probably know this, but just to make sure future readers are not misled: on x86, non-locked read-modify-write instructions like add [x], 1 perform an atomic load and an atomic store, analogous to x.store(x.load(memory_order_relaxed)+1, memory_order_relaxed), but not atomically with each other; a store from another thread could occur between the two. To get an atomic read-modify-write, analogous to x.fetch_add(1, memory_order_relaxed), you need lock add [x], 1. An exception is xchg mem, reg which always has an implicit lock.)

I think that, at least on ARM and x86, overlapping accesses that are aligned, but of different sizes, are also guaranteed to behave atomically with respect to each other, "as expected".

For misaligned accesses, things are harder, and you have to refer to the fine print of the architecture spec. AFAIK base ARMv8-A guarantees nothing except that each byte of the misaligned object is loaded/stored atomically. ARMv8-4 with FEAT_LSE2 promises atomicity as long as you do not cross a 16-byte boundary. x86 CPUs since P6 guarantee atomicity as long as you do not cross a cache line boundary. Still, in all cases, the worst-case scenario is that you load a "torn" value that does not match anything that was ever stored.