I need to build a lock-free stack implementation. I read this page and I understand the functionality of the listed lock-free push operation.
Now, I have to build a similar version of the pop operation. This is what I've done until now but I think, there are some concurrency problems:
template <class T>
bool CASStack<T>::pop(T& ret) {
node<T>* old_head = head.load(std::memory_order_relaxed);
if(old_head == nullptr) {
return false;
}
// from here on we can assume that there is an element to pop
node<T>* new_head;
do {
new_head = old_head->next;
} while(!head.compare_exchange_weak(old_head, new_head, std::memory_order_acquire, std::memory_order_relaxed));
ret = old_head->data;
return true;
}
I think I also will get trouble if I delete old_head after the swap, right?
EDIT: Updated question!
Your node<T>* new_head = old_head->next;
is a red herring; you never use this variable.
In my comments suggesting you needed to put it inside a do{}while(!CAS)
loop, I was thinking you were doing head.CAS(old_head, new_head)
. This would have the problems I was talking about, of putting a possibly-stale pointer into the list if the CAS did have to retry.
But you're actually doing head.CAS(old_head, old_head->next)
which generates the "desired" value from the updated old_head
every time through the loop. This is actually correct, but hard to follow, so I'd suggest using a do{}while()
like so:
// FIXME: this may suffer from ABA problems; see other answers.
node<T>* pop(std::atomic<node<T>*> &head)
{
// We technically need acquire (or consume) loads of head because we dereference it.
node<T>* old_head = head.load(std::memory_order_acquire);
node<T>* new_head;
do {
if(old_head == nullptr) {
// need to re-check because every retry reloads old_head
// pop in another thread might have emptied the list
return nullptr;
}
new_head = old_head->next;
// if head still equals old_head this implies the same relation for new_head
} while(!head.compare_exchange_weak(old_head, new_head,
std::memory_order_acquire));
// Note the ordering change: acquire for both success and failure
return old_head; // defer deletion until some later time
}
(To deal with possible ABA problems, a pointer+sequence-number struct may be needed. Doing that in a way that still allows efficient loads of just the pointer can be pretty hacky: How can I implement ABA counter with c++11 CAS?. See also other answers on this question which address the ABA problem; I wrote this answer a while ago and don't guarantee that the whole thing adds up to a usable lock-free stack!)
Is it allowed to do
old_head->next
within compare_exchange_weak? Is this still atomic?
The CAS is still atomic. Any compare_exchange_weak
that compiles is itself atomic. The compiler evaluates the args before the function call, though, so reading old_head->next
isn't part of the atomic transaction that CAS does. It's already been read separately into a temporary. (Doing this explicitly with a separate variable like in the do{}while
loop is common.)
If node::next
is an atomic<>
member of node
, you should think about what memory order you want to use for that load. But for a pure stack, it doesn't have to be atomic, because linked-list nodes are never modified while they're on the stack, only before being pushed with the right next
pointer. Shared read-only access is not a race.
Usage as a pure stack also reduces deletion problems: threads can't "peek" at the head node or traverse the list. They can only look inside a node after popping it, and the pop
algorithm ensures they have exclusive ownership of the node (and are responsible for deleting it).
But pop()
itself needs to load from the head
node. If another thread races with us and returns the memory for that head
to the OS, we could fault. So we do have a deletion problem like RCU does, like I mentioned in a comment.
Simply reusing the memory for something else wouldn't be a problem on most C++ implementations, though: we would read a garbage value for old_head->next
, but CAS would fail (because the head
pointer must have changed before the old head object was freed) so we'd never do anything with the bogus value we loaded. But it's still C++ UB for our atomic load to race with a non-atomic store. But a compiler would have to prove that this race actually does happen before it's allowed to emit anything other than normal asm, and all mainstream CPUs don't have any problem with such a race in asm.
But unless you can guarantee that free()
or delete
just put the memory on a free list, i.e. that they don't munmap
it between a load of head
and a deref of old_head->next
, the above reasoning doesn't make it safe for the caller to delete pop
's return value right away. It only means problems are very unlikely (and hard to detect with simple testing).
We load head
and then expect that pointer to point to useful values. (i.e. old_head->next
). This is exactly what memory_order_consume
gives us. But it's hard to use, and so hard to optimize that compilers just strengthen it to acquire
, which makes it impossible to test code that uses consume
. So we really want acquire
for all our loads of head
.
(Using consume
with current compilers is equivalent to acquire
. If you actually need the performance of data dependency ordering with no barriers, see C++11: the difference between memory_order_relaxed and memory_order_consume for how to try to safely use relaxed
.)
Note that getting the value out of the node we pop also depends on memory ordering, but I think if we didn't need old_head->next
we could use relaxed everywhere but in the success
side of the CAS (where we would need at least consume
, so in practice acquire
).
(On mainstream C++ implementations we could probably get away with relaxed
on all architectures except DEC Alpha AXP, the famously weakly-ordered RISC from the 90s. The compiler will almost certainly create code with data dependencies on the loaded pointer, because it doesn't have any other way to access the values it needs. And all "normal" hardware except Alpha provides the mo_consume
style dependency ordering for free. So testing with relaxed
would never show problems unless you had one of the rare models of Alpha that actually could produce this reordering in hardware, and a working C++11 implementation for it. But it's still "wrong", and could potentially break with compile-time reordering, or maybe I'm missing something and relaxed
might actually break in practice without inlining into something more complex + constant-propagation.)
Note that these mo_acquire
loads synchronize-with the mo_release
store in the thread that pushed the object pointed to by the current head
. This prevents our non-atomic loads from old_head
from racing with non-atomic stores to the node in the thread that pushed it.