I have system with 3 CPUs sharing memory and bus. There is no cache in the system there are however store buffers. CPUs have Compare And Swap (CAS
) instruction available that is atomic. Additionally there are LoadMemoryBarrier()
instruction that ensures loads cannot be reordered across the barrier StoreMemoryBarrier()
ensures that no stores can be reordered across the barrier and GeneralMemoryBarrier()
that ensures no memory access can be reordered across the barrier.
CPUs can reorder stores against stores or loads, loads against stores or loads and either against atomic operations. Basically they can reorder memory access any way they please.
unsigned int spinlock = 0;
int x = 0;
void lock()
{
while (CAS(&spinlock, 1, 0) == 1); // Wait to acquire lock
// Spinlock acquired
LoadMemoryBarrier();
}
void unlock()
{
GeneralMemoryBarrier();
spinlock = 0;
}
void CPU1()
{
lock();
x += 1;
unlock();
}
void CPU2()
{
lock();
x -= 1;
unlock();
}
void CPU3()
{
printf("%d", x);
}
Can store to x
be ordered to happen before spinlock is acquired? My understanding is that no, it cannot be and so I do not see any case where we would need more then LoadMemoryBarrier()
inside of lock()
function. Is my understanding correct and can you provide an example of where this barrier is not enough of a guarantee?
EDIT:
I am using C99 and do not have access to stdatomic. Assume CAS
does not include any implicit barriers.
So your C is basically just pseudo-code for asm for this ISA, and we need to assume no compile-time reordering. (Or that these special barrier functions allow as much compile-time reordering as they allow at run-time?)
Your memory model is pretty similar to 32-bit ARM: anything can reorder, and the choice of non-full barriers includes LoadLoad and StoreStore, but no way to block LoadStore reordering with anything weaker than a full barrier (which is expensive because it has to drain the store buffer before any later loads or stores, to block StoreLoad reordering).
In the general case, that means your lock()
function needs to upgrade its barrier to GeneralMemoryBarrier()
to get acquire semantics; blocking only LoadLoad is not sufficient. You also need to block LoadStore - see https://preshing.com/20120913/acquire-and-release-semantics/. If a thread did lock(); x = 3; unlock();
, the store could become visible before this thread acquires the lock. Another thread could see x
's value change while it still owned the lock. This is easy to see if we consider compile-time reordering: the store moves up past the LoadLoad barrier, and up past the CAS loop.
It might not be plausible for a real CPU to actually reorder a store ahead of an atomic RMW retry loop; it can't commit stores to memory (or coherent cache if you had any) unless they're known to be non-speculative. But just considering the memory-ordering rules that a CAS can reorder in either direction with a store, this is allowed.
Your unlock()
is fine if this were asm. Since you keep claiming it's real C99, an assignment to a non-volatile
variable can be problematic. Probably fine in this case, but see Who's afraid of a big bad optimizing compiler? for a lot of the many well-known and more subtle things that can go wrong when relying on just barriers without volatile
to force the compiler to emit instructions that access memory.
x += 1
in the critical sectionIn this case the store side of the x += 1
is dependent on a load. Even with something like value-prediction to make a speculative result ready before the load is allowed to happen, a CPU can't let mis-speculated values become visible to other cores (via memory), or they'd also have to get rolled back on mis-speculation.
Real-world CPUs basically have no choice but to do dependency ordering, that a store whose value depends on an earlier load can't become visible before that load produces a value. (DEC Alpha famously violated causality with a load whose address depended on an earlier load, via some weird cache partitioning effect on a few models.)
If we can rely on the data dependency to ensure that the store half of x += 1
isn't visible until after the load half, a LoadLoad barrier after CAS is sufficient to contain this critical section.
CAS spinlock (load and store.)
LoadLoad barrier
load x // can't reorder earlier than either barrier
store x // can't happen earlier than load x
Full barrier
store spinlock
The store half of the CAS (atomic RMW) can reorder later, but it's still an atomic RMW. The only necessary thing is to keep the critical section ops after the load side of the atomic RMW that takes the lock. (In C11, taking a mutex has only acquire semantics.)