c++concurrency linked-list atomic lock-free

Is this lock-free dlist insertion is safe?

I need to implement a lock-free insertion of a sub-list in the head of a doubly-linked list. This list has a dummy head, so each thread tries to insert its part right after the head node. This design seems OK for me, however, I don't have enough expertise to prove it.

struct Node {
  std::atomic<Node*> next;
  std::atomic<Node*> prev;
};
Node head;

// concurrently insert `first`..`last` sublist after `head`
void insertSublist(Node* first, Node* last) {
  first->prev = &head;
  Node* current_next = head.next;

  while (true) {
    last->next = current_next;
    if (head.next.compare_exchange_weak(current_next, first)) {
      current_next->prev = last;
      return;
    }
  }
}

I need to validate this design under these circumstances:

Variant 1

No list removal is performed, all threads just do insertion in a loop.

Variant 2

There is 1 thread, that removes nodes from the list in random order, but it will never remove a node that is right after the head node.

for (auto* node : nodes_to_be_removed) {
  if (node->prev == &head)
    continue;
  // perform removal 
}

When insertion is done, node->prev is the last link that is changed. So after it is changed, no other thread (except remover) can access the node or its preceding node next link. Is this reasoning valid or I am missing something?

Some clarifications after @peter-cordes answer.

The list is not linearly traversable, so inconsistent list state is not an issue from this point of view.
If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet

I expect that the check node->prev == &head prevents this case. Is it true?
Are removals safe under these conditions?
- There is only 1 remover thread
- Remover has a separate worklist of nodes for deletion
- A node can be added to the worklist only after its insertion stage is completely finished

Solution

TL:DR: inserts alone are sort of ok depending on what readers do (no long-term corruption), but removals are probably impossible without locking or much more sophistication, and definitely a showstopper for this simple insert algorithm.

Update: @davidhigh has identified (in comments) a potential problem for the ->prev links, where some nodes might be missing from the backwards traversal order even after the dust settles from some concurrent inserts.

Otherwise, if another writer inserts a sublist between the CAS and the store to current_next -> prev, this new sublist is lost forever in the backwards traversal, isn't it?

Doing that with a CAS (against a value read at some earlier point...?) might detect this problem, and in the failure case we could maybe traverse the list forward looking for last? It's non-trivial and I don't have an answer.

It's a double-linked list, so an insert unavoidably requires modifying two memory locations that other threads can already see: head.next, and the .prev pointer in the old first node. This can't be done atomically+locklessly unless the hardware has a DCAS (double-CAS, two separate non-contiguous locations at once). As the wikipedia article says, it makes lockless double-linked lists easy.

m68k had DCAS at one point, but no current mainstream CPU architecture has it. ISO C++11 doesn't expose a DCAS operation through std::atomic because you can't emulate it on HW that doesn't have it without making all atomic<T> non-lock-free. Except on hardware with transactional memory, available on some recent x86 CPUs from Intel (e.g. Broadwell and later), but not AMD. There has been some work on adding syntax for TM to C++, see https://en.cppreference.com/w/cpp/language/transactional_memory

Of course, it's also impossible for an observe to observe two locations at once, atomically, without transactional memory or something like DCAS. So any threads that read the list need to expect it to be changing out from under them, especially if the list is also supposed to support removals.

Setting up the pointers within the new nodes (not yet published to other threads) before publishing is obviously good, and you're doing that. first->prev and last->next are both set correctly before a CAS attempt to publish these new nodes. The CAS has sequential-consistency memory ordering, so it makes sure those previous stores are visible to other threads before it itself is. (So those "private" stores might as well be std::memory_order_relaxed for efficiency).

Your choice of modifying the old-first's .prev pointer after modifying head makes a lot of sense. You're essentially publishing first in the forward direction, then in the reverse direction. But remember that it's possible for a thread to sleep for a long time at any point, so it's not 100% safe to assume this will always be a momentary inconsistency. Imagine stopping one thread in the debugger at any point inside this function, and even single-stepping, while other threads run. In this case there are only 2 interesting operations, the CAS and the unconditional store to the old first non-dummy node.

If a thread was traversing forward and depending on being able to go back by following a .prev (instead of remembering its own previous in a local variable), this might look to it like the new nodes had been deleted again. It can find a .prev pointing to head. This is a contrived example because it would normally be more efficient to simply remember your previous node if you want to be able to find it again, especially in a lockless list. But perhaps there are non-contrived cases, like maybe one thread traversing forward and another traversing backward, and maybe interacting directly or indirectly, where an inconsistency would be visible.

As long as all threads agree on which order to modify things, I think insertion itself is safe at least for the forward links. Doing it only at the head makes it easier to verify, but I think allowing arbitrary insertion points is still safe.

Your current code looks safe for simultaneous insertions (assuming no removals). The forward list can be longer than the backward list (with potentially multiple insertions into the backward list outstanding), but once they all complete ~~the list will be consistent~~.

Correction, the forward list will be consistent, but the backward list might lose track of a node or sublist.

Without removal, each pending write to a .prev has a valid destination, and that destination is a node that no other thread wants to write to. Lockless singly-linked list insertion is easy, and the forward list (the .next links) are always up to date.

So each insert operation "claims" a node that it uses as the insertion point into the reverse list, where its store to current_next->prev becomes visible.

Or does it? I think multiple insertions at the same point can succeed their forward CAS before either has done their backward (->prev store, so we're not claiming exclusive ownership of a node after all.

A do{}while(!CAS()) loop is a nice idiom, usually simplifying the code. I weakened the memory-ordering of the other operations, especially the private ones on first and last, because requiring the compiler to use slow barriers after stores to elements other threads can't see yet is inefficient. On x86, release-stores are "free" (no extra barriers), while seq-cst stores are a lost more expensive. (A seq-cst store on x86 has about the same cost as an atomic read-modify-write, in the uncontended case).

// no change in logic to yours, just minimize memory ordering
// and simplify the loop structure.
void insertSublist(Node* first, Node* last)
{
  first->prev.store(&head, std::memory_order_relaxed);
  Node* current_next = head.next.load(std::memory_order_relaxed);

  do {
    // current_next set ahead of first iter, and updated on CAS failure
    last->next.store(current_next, std::memory_order_relaxed);
  }while (!head.next.compare_exchange_weak(current_next, first));
   // acq_rel CAS should work, but leave it as seq_cst just to be sure.  No slower on x86

  current_next->prev.store(last, std::memory_order_release);  // not visible before CAS
}

This compiles for x86 with zero mfence instructions instead of 3 for yours, on the Godbolt compiler explorer. (The rest of the asm is literally identical, including the one lock cmpxchg.) So in the uncontended no-RFO case (e.g. repeated insertions from the same core), it might be something like 4x faster. Or better because mfence is actually even slower than a lock prefix on Intel CPUs.

Plus, the do{}while(!CAS) with the final store fully outside the loop is arguably easier for humans to read and see the logic right away.

Removals are a massive complication

I don't see how you could safely remove while you have pending insertions. If you remove a node that an inserter was going to modify (to add the backward link) but hasn't yet, that range of nodes will be forever missing from the backward list.

(Plus if you recycle the memory for that Node, the store by the inserter then steps on something.)

It will make the forward and backward lists out of sync. I don't see a way to solve this without DCAS (or transactional memory, which is a superset of DCAS). I'm not a lockless dlist expert, though, so maybe there's a trick.

Probably even multiple simultaneous removers is a problem, because you can end up with pending modifications to a node that another thread is going to (or already has) removed. Or multiple pending modifications to one node, with no way to make sure that the right one finishes last.

If you had an inserters/removers lock (multiple inserters / single remover, exactly like a readers/writer lock), you could make sure there were no pending inserts when doing a removal. But still allow lockless inserts. Maybe put the lock in the same cache line as head, because inserting threads will always need to modify both it and head. Or maybe not because you might just end up with more contention for that line if cores sometimes lose ownership of the line after taking the lock but before committing their modification to head.