C++ atomic load ordering efficiency

I have a memory variable that is updated in thread A and read in other threads. The reader only cares if the value is non-zero. I am guaranteed that once the value is incremented, it never goes back to zero. Does it make sense to optimize as below? In other words, on the reader side, I dont need "fence" once I got my condition satisfied.

std::atomic<int> counter;

writer:
increment()
{ 
    counter.store(counter+1, std:memory_order_release)
}

reader:
iszero()
{
    if (counter.load(std::memory_order_relaxed) > 0) return false;
    // memory fence only if condition not yet reached
    return (counter.load(std::memory_order_acquire) == 0);
}

Solution

First, if you've not actually tried using the default (sequentially consistent) atomics, measured the performance of your app, profiled it, and shown observed them causing a performance problem, I'd suggest turning back now.

However, if you really do need to start reasoning about relaxed atomics...

That is not guaranteed to do what you expect, although it will almost certainly work on x86.

I'm guessing that you're using this to guard the publication of some other non-atomic data.

In that case, you need the guarantee that if you read a non-zero value in the reader thread, then various other side-effects to non-atomic memory locations (i.e. initializing the data you're publishing) that you made in the writer thread prior to the store will be visible to the reader thread.

Reading non-zero with std::memory_order_relaxed does not synchronize with the std::memory_order_release store, so your code above does not have this guarantee.

To get the behaviour I've described, you need to use std::memory_order_acquire. If you're on x86, then acquire doesn't produce any memory fence instructions, so the only way it will differ in performance from memory_order_relaxed is via preventing some compiler optimizations.