For locks (spin locks, mutex), we usually need to add acquire fence when locking to ensure the functionality of the lock.
But is this necessary for one-shot spin locks? For example:
int val = 0;
atomic_int lock = 0;
void thread0(void)
{
int tmp = 0;
if (atomic_compare_exchange_strong_explicit(&lock, &tmp, 1, memory_order_relaxed, memory_order_relaxed)) { // do we need memory_order_acquire here ?
assert(!val); // will it always success?
val = 1;
}
}
// same as thread0
void thread1(void)
{
int tmp = 0;
if (atomic_compare_exchange_strong_explicit(&lock, &tmp, 1, memory_order_relaxed, memory_order_relaxed)) {
assert(!val);
val = 1;
}
}
More specifically, is the following code correct on the armv7-a architecture(There may be some differences from the C code mentioned above):
val:
.long 0
lock:
.long 0
core0:
mov r0, #val
mov r1, #lock
mov r4, #1
2:
ldrex r2, [r1]
cmp r2, #0
beq 1f
bx lr // ret
1:
strex r3, r4, [r1]
cmp r3, #0
bne 2b
// without acquire fence
ldr r5, [r0] // is r5 != 0 allowed?
core1:
mov r0, #val
mov r1, #lock
mov r4, #1
2:
ldrex r2, [r1]
cmp r2, #0
beq 1f
bx lr // ret
1:
strex r3, r4, [r1]
cmp r3, #0
bne 2b
dmb ish // acquire fence
str r4, [r0] // store 1
A more specific example (do work and do clean should not have race) :
#define EXIT_FLAG 1
#define WORK_FLAG 2
atomic_int state = 0;
void thread0(void)
{
int tmp;
while (1) {
tmp = 0;
if (!atomic_compare_exchange_strong_explicit(&state, &tmp, WORK_FLAG, memory_order_relaxed, memory_order_relaxed)) { // do we need acquire here?
assert(tmp == EXIT_FLAG);
return;
}
// do work
tmp = WORK_FLAG;
if (!atomic_compare_exchange_strong_explicit(&state, &tmp, 0, memory_order_release, memory_order_relaxed)) {
assert(tmp == (EXIT_FLAG | WORK_FLAG));
// do the clean
return;
}
}
}
void thread1(void)
{
int tmp = 0;
while (1) {
if (atomic_compare_exchange_strong_explicit(&state, &tmp, tmp | EXIT_FLAG, memory_order_acquire, memory_order_relaxed)) // we need acquire here to fit with release in thread0
break;
}
if (!(tmp & WORK_FLAG)) {
// do the clean
}
}
This section is about the first code block, where both threads do
if (lock.CAS_strong(0, 1, relaxed)){ assert(val==0); val=1; }
The assert (in the first code block) always succeeds because only one thread or the other ever runs the if
body, and it's sequenced before the val=1
in the same thread.
The atomic RMW decides which wins the race to be that thread, but it doesn't need to sync with any previous writer to prevent overlap of their "critical sections". (https://preshing.com/20120913/acquire-and-release-semantics/)
You don't have a critical section. The load of val
could have happened before the CAS, but that would still be fine because you'd still load the initial value then. And there's no release
store so nothing lets other threads know that your val
update is complete.
I wouldn't call it a one-shot spinlock, that has potentially misleading implications like that you'll sync-with the lock
variable and that other threads can see when you're done. (BTW, lock
needs to bet atomic_int
aka _Atomic int
not plain int
.)
This is a bit like a guard variable for a non-constant initializer for a static
variable which is one-shot for the whole program, although that actually does still need acquire
, unfortunately, unless you can separate the case where one thread very recently finished init vs. cases where we've already synced-with the init. Maybe a third value for the guard variable and something like a global all-threads membarrier() system call that runs itself on all cores / threads.
The new code has one thread running an infinite loop around try_lock()
/ do work / unlock
. The other thread spinning on a CAS(acquire)
to set another bit that will make the first thread stop, and to sync with its release store.
But there are no shared variables other than atomic_int state;
, so there's no difference between relaxed
and release
/acquire
. Operations on a single object don't reorder with each other within the same thread even for relaxed
. seq_cst
would forbid store-forwarding where one thread sees its own stores before they're globally visible (to all threads).
The first loop will exit (via one if or the other) on the first failed CAS_strong. So as soon as the second thread succeeds at its CAS to set the second bit.
This isn't a locking algorithm: the second thread can succeed at setting EXIT_FLAG while the first thread is inside its "critical section", i.e. while WORK_FLAG
is set in state
. No strengthening of the memory_order
parameter to any of the operations can change this.
Without acquire
in the first thread's CAS which "takes the lock" (setting WORK_FLAG
), later operations can become visible to other threads before state
changes. That's a problem for a normal lock, but this is far enough from being a lock that it's not obvious what exactly do work
and do the clean
are supposed to be.
Only one thread will run do the clean
; either the first or second thread depending on when the second thread succeeds at a CAS.