c++multithreading memory-model compare-and-swap stdatomic

Why std::memory_order_relaxed is preferred at CAS loop when failing?

When it comes to implementing CAS Loop using std::atomic, cppreference in this link gives the following example for push:

template<typename T>
class stack
{
    std::atomic<node<T>*> head;
 public:
    void push(const T& data)
    {
      node<T>* new_node = new node<T>(data);
      new_node->next = head.load(std::memory_order_relaxed);

      while(!head.compare_exchange_weak(new_node->next, new_node,
                                        std::memory_order_release,
                                        std::memory_order_relaxed /* Eh? */));
    }
};

Now, I don't understand how come std::memory_order_relaxed is used for the failure case, because as far as I understand, compare_exchange_weak (same for -strong but I'll just use the weak version for convenience) is a load operation at failure, which means it loads from a successful CAS operation in another thread with std::memory_order_release, and thus it should use std::memory_order_acquire to be synchronized-with instead...?

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release,
                                  std::memory_order_acquire /* There you go! */));

What if, hypothetically, the 'relaxed load' gets one of the old values, ending up failing again and again, staying in the loop for extra time?

The following scratchy picture is where my brain is stuck at.

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

So to sum up my question,

Why not std::memory_order_acquire, instead of std::memory_order_relaxed at failure?
What makes std::memory_order_relaxed sufficient?
Does std::memory_order_relaxed at failure mean (potentially) more looping?
Likewise, does std::memory_order_acquire at failure mean (potentially) less looping? (besides the downside of the performance)

Solution

I don't understand how come std::memory_order_relaxed is used for the failure

And I don't understand how you complain about lack of acquire semantic on that failure branch yet don't complain about

head.load(std::memory_order_relaxed);

and then

while(!head.compare_exchange_weak(new_node->next, new_node,
                                  std::memory_order_release

neither of which has an acquire operation "to be synchronized-with" some other operation that you don't show us. What is that other operation that you care about?

If that operation is important show the operation and tell use how that code depends on the "publication" (or "I'm done" signal) of that other operation.

Answer: the push function in no way depends on the publication of any "I'm done" signal by other function, as push does not use other published data, does not read other pushed elements, etc.

Why not std::memory_order_acquire, instead of std::memory_order_relaxed at failure?

To acquire what? In other words, to observe what accomplishment?

Does std::memory_order_relaxed at failure mean (potentially) more looping?

No. The failure mode has nothing to do with the memory visibility; it's a function of the mechanism of the CPU cache.

EDIT:

I just saw the text in your image:

Shouldn't a store from T2 be visible at T1? (by having synchronized-with relation with each other)

Actually you misunderstood synchronized-with: it doesn't propagate the value of the atomic variable that is being read, as an atomic by definition is a primitive usable with a race condition. A read of an atomic always returns the value of the atomic variable, as written by some other thread (or the same thread). If it wasn't the case, then no atomic operation would be meaningful.

No memory ordering is ever needed to read a single atomic variable.