Simple lock free stack c++11

I've seen several overly complicated (in my opinion obviously) implementation of a lock free stack in c++ (using tags like here) and I came up with what I believe is a simple and still valid implementation. Since I couldn't find this implementation anywhere (I've seen the Push function implemented similarly to what I did but not the Pop), I'm guessing that it's incorrect in some way (most likely fails the ABA case):

template<typename Data>
struct Element
{
  Data mData;
  Element<Data>* mNext;
};

template<typename Data>
class Stack
{
 public:
  using Obj = Element<Data>;
  std::atomic<Obj*> mHead;

  void Push(Obj *newObj)
  {
    newObj->mNext = mHead.load();
    //Should I be using std::memory_order_acq_rel below??
    while(!mHead.compare_exchange_weak(newObj->mNext, newObj));
  }

  Obj* Pop()
  {
    Obj* old_head = mHead.load();
    while (1)
    {
      if (old_head == nullptr)
        return nullptr;
      //Should I be using std::memory_order_acq_rel below??
      if(mHead.compare_exchange_weak(old_head, old_head->mNext)) ///<<< CL1
        return old_head;
    }
  }
};

I'm assuming that the callers of Push and Pop will take care of memory allocation and deallocation. Another option is to make the above Push and Pop private methods and have new public functions that would take care of memory allocation and call these functions internally. I believe the most tricky part of this implementation is the line I marked with "CL1". The reason I think it's correct and still works in the ABA case is the following:

Lets say the ABA case does happen. Meaning that mHead at "CL1" would be equal to old_head but the object they're both pointing to would actually be different than the one old_head was originally pointing to when I assigned mHead to it. But, I think that even if it's a different object we're still OK, since we know it's a valid "head". old_head is pointing to the same object as mHead is, so it IS the valid head of the stack and this means that old_head->mNext is the valid next head. So, updating mHead to old_head->mNext is still correct!

To summarize:

If mHead != old_head (another thread preempted us and changed mHead) -> old_head is updated to be the new mHead and we start the loop again.
[NON-ABA] If mHead == old_head -> simple case, update mHead to be old_head->next (==mHead->mNext) and return old_head.
[ABA] If mHead == old_head -> works as explained above.

So, is my implementation valid? What am I missing?

Solution

ABA happens when:

Thread A reads old_head->mNext and blocks before calling compare_exchange_weak.
Thread B pops the current node pushes some other nodes, then pushes the original node back onto the stack.
Thread A unblocks, completes compare_exchange_weak successfully since mHead has the same value, but stores the stale mNext value as the new mHead.

See this answer for more details, you have both problem #2 (data race on mNext) and problem #3 (ABA).