Holding a mutex lock after all writes are finished

I have a question about thread safety and mutex locks in C++. I know that at the simplest level, a mutex lock should be held at any point where there could be simultaneous reading and writing to the same memory. However, I also know that each core has its own cache, so I picture the mutex lock as a way to ensure that all threads "agree" on the state of the data at any time point.

The thing I am unsure about is what happens when all the writing operations are done. Is it safe to use the data then without owning the mutex lock?

See example below. Based on my shallow understanding, it feels like the last line could produce undefined behavior since the code is not aware of all the writing operations that have been done in the threads. I.e. what if the arr in the main thread cache looks different from the arr's in the worker thread caches.

std::mutex mtx;
std::vector<int> arr;

std::list<std::thread> threads;
for (int i = 0 ; i < 10 ; ++i) {
   threads.push_back([&](){
      std::lock_guard<std::mutex> lock(mtx);
      // Change arr here.
   });
}
for (auto& thread : threads) {
   thread.join();
}
// Can I use arr here without owning the mutex lock?
std::cout << arr[i] << std::endl;

So my question is if the above example is safe, and if not, can I ever use arr without holding onto a mutex lock?

Solution

Yes, this code is safe.

The C++ standard's description of std::thread::join() includes:

The completion of the thread represented by *this synchronizes with (6.9.2) the corresponding successful join() return.

That implies that if some operation X happens before (in the technical sense of the C++ memory model) the completion of the thread, and some operation Y happens after the return of join(), then X happens before Y. In your program, this means that the operations on arr within the various threads will all happen before the access in the main thread which follows the join(). Thus there is no data race.

For the implementation, this means in effect that the cleanup code that terminates a thread must contain a release barrier before it notifies the thread waiting on join(), and the code for join() must have an acquire barrier after receiving that notification. In practice, the necessary barriers are likely to be automatically provided by the OS services that exit and wait for threads.