C++ atomic increment with memory ordering

After I have read C++ concurrency in action Chapter 5, I tried to write some code to test my understanding of memory ordering:

#include <iostream>
#include <vector>
#include <thread>
#include <atomic>

std::atomic<int> one,two,three,sync;

void func(int i){
    while(i != sync.load(std::memory_order_acquire));
    auto on = one.load(std::memory_order_relaxed); ++on;
    auto tw = two.load(std::memory_order_relaxed); ++tw;
    auto th = three.load(std::memory_order_relaxed); ++th;
    std::cout << on << tw << th << std::endl;
    one.store(on,std::memory_order_relaxed);
    two.store(tw,std::memory_order_relaxed);
    three.store(th,std::memory_order_relaxed);
    int expected = i;
    while(!sync.compare_exchange_strong(expected,i+1,
            std::memory_order_acq_rel))
        expected = i;
}

int main(){
    std::vector<std::thread> t_vec;
    for(auto i = 0; i != 5; ++i)
        t_vec.push_back(std::thread(func,i));
    for(auto i = 0; i != 5; ++i)
        t_vec[i].join();
    std::cout << one << std::endl;
    std::cout << two << std::endl;
    std::cout << three << std::endl;
    return 0;
}

My question is: The book says memory_order_release and memory_order_acquire should be a pair to correctly read the right value.

So if the first line of func() is load sync within a loop with memory_order_acquire, it should break the pair and make an unpredictable error on synchronization.

However, as expected, it prints after being compiled on my x86 platform:

The result shows no problem. So I just wonder what happens within func() (though I wrote it by myself...)?

Added: According to the code on C++ concurrency in action page 141:

#include <atomic>
#include <thread>

std::vector<int> queue_code;
std::atomic<int> count;

void populate_queue(){
    unsigned const number_of_items = 20;
    queue_data.clear();
    for(unsigned i = 0; i < number_of_items; ++i)
        queue_data.push_back(i);
    count.store(number_of_items, std::memory_order_release);
}

void consume_queue_items(){
    while(true){
        int item_index;
        if((item_index=count.fetch_sub(1,memory_order_acquire))<=0){
            wait_for_more_items();
            continue;
        }
        process(queue_data[item_index-1]);
    }
}

int main(){
    std::thread a(populate_queue);
    std::thread b(consume_queue_items);
    std::thread c(consume_queue_items);
    a.join();
    b.join();
    c.join();
}

Thread b and thread c will work just fine, no matter who access count first. Because:

Thankfully, the first fetch_sub() does participate in the release sequence, and so the store() synchronizes-with the second fetch_sub(). There's still no synchronizes-with relationship between the two consumer threads There can be any number of links in the chain, but provided they're all read-modify-write operation such as fetch_sub(), the store() will still synchronize-with each one that's tagged memory_order_acquire.In this example, all the links are the same, and all are acquire operations, but they could be a mix of different operations with different memory_ordering semantics.

But I can't find related information about this, and How read-modify-write operation such as fetch_sub() participate in the release sequence? If I change it to load with memory_order_acquire, will store() still synchronizes-with load() in each independent thread?

Solution

Your code shows a basic spinlock mutex that lets each thread take the lock implicitly by recognizing its own value, rather than changing a state.

Memory ordering is correct and even stronger than it technically needs to be. The compare_exchange_strong at the bottom is not necessary; a plain store with a release barrier would suffice:

sync.store(i+1, std::memory_order_release);

Reordering of the relaxed operations is possible but does not change the output of your program. There is no undefined behavior and the same output is guaranteed on all platforms.
In fact, one, two and three don't even have to be atomic because they are only accessed within your spinlock mutex and after all threads have joined.

So if the first line of func() is load sync within a loop with memory_order_acquire, it should break the pair and make an unpredictable error on synchronization.

Acquire/release pairing is correct since the release barrier at the bottom (in thread X) pairs with the acquire barrier at the top (in thread Y). That the first thread acquires with no previous release is fine since there is nothing to release yet.

About the 'Added' part:

How read-modify-write operation such as fetch_sub() participate in the release sequence?

This is what standard says in 1.10.1-5:

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation:

is performed by the same thread that performed A, or

is an atomic read-modify-write operation.

So, in order to release data to another processor, a load/acquire operation needs to observe the value that was stored by the release operation or a later value as long as it satisfies one of these requirements.
Apparently, read-modify-write operations have additional properties that prevent updates to the atomic variable from reaching other processors in a less-defined order.

If I change it to load with memory_order_acquire, will store() still synchronizes-with load() in each independent thread?

If you change the read-modify-write to a separate load/acquire (that sees the updated value) and a store/release, it is still correct, but it is no longer part of the same release sequence; you have created a separate release sequence.