c++multithreading c++11 boost-asio memory-barriers

Which Memory Order Should I use for a Host Thread waiting on Worker Threads?

I've got code that dispatches tasks to an asio io_service object to be remotely processed. As far as I can tell, the code behaves correctly, but unfortunately, I don't know much about memory ordering, and I'm not sure which memory orders I should be using when checking the atomic flags to ensure optimal performance.

//boost::asio::io_service;
//^^ Declared outside this scope
std::vector<std::atomic_bool> flags(num_of_threads, false);
//std::vector<std::thread> threads(num_of_threads);
//^^ Declared outside this scope, all of them simply call the run() method on io_service

for(int i = 0; i < num_of_threads; i++) {
    io_service.post([&, i]{
        /*...*/
        flags[i].store(true, /*[[[1]]]*/);
    });
}

for(std::atomic_bool & atm_bool : flags) while(!atm_bool.load(/*[[[2]]]*/)) std::this_thread::yield();

So basically, what I want to know is, what should I substitute in for [[[1]]] and [[[2]]]?

If it helps, the code is functionally similar to the following:

std::vector<std::thread> threads;
for(int i = 0; i < num_of_threads; i++) threads.emplace_back([]{/*...*/});
for(std::thread & thread : threads) thread.join();

Except that my code keeps the threads alive in an external thread pool and dispatches tasks to them.

Solution

You want to establish a happens-before relation between the thread setting the flag and the thread seeing that it was set. This means that once the thread sees the flag is set, it will also see the effects of everything that the other thread did before setting it (this is not guaranteed otherwise).

This can be done using release-acquire semantics:

flags[i].store(true, std::memory_order_release);
// ...
while (!atm_bool.load(std::memory_order_acquire)) ...

Note that in this case it might be cleaner to use a blocking OS-level semaphore than to spin-wait on an array of flags. Failing that, it would still be slightly more efficient to spin on a count of completed tasks instead of checking an array of flags for each.