Does this multithreaded list processing code have enough synchronization?

I have this code in test.cpp:

#include <atomic>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>

static const int N_ITEMS = 11;
static const int N_WORKERS = 4;

int main(void)
{
    int* const items = (int*)std::malloc(N_ITEMS * sizeof(*items));
    for (int i = 0; i < N_ITEMS; ++i) {
        items[i] = i;
    }

    std::thread* const workers = (std::thread*)std::malloc(N_WORKERS * sizeof(*workers));
    std::atomic<int> place(0);
    for (int w = 0; w < N_WORKERS; ++w) {
        new (&workers[w]) std::thread([items, &place]() {
            int i;
            while ((i = place.fetch_add(1, std::memory_order_relaxed)) < N_ITEMS) {
                items[i] *= items[i];
                std::this_thread::sleep_for(std::chrono::seconds(1));
            }
        });
    }
    for (int w = 0; w < N_WORKERS; ++w) {
        workers[w].join();
        workers[w].~thread();
    }
    std::free(workers);

    for (int i = 0; i < N_ITEMS; ++i) {
        std::cout << items[i] << '\n';
    }

    std::free(items);
}

I compile like so on linux:

c++ -std=c++11 -Wall -Wextra -pedantic test.cpp -pthread

When run, the program should print this:

I don't have much experience with C++, nor do I really understand atomic operations. According to the standard, will the worker threads always square item values correctly, and will the main thread always print the correct final values? I'm worried that the atomic variable will be updated out-of-sync with the item values, or something. If this is the case, can I change the memory order used with fetch_add to fix the code?

Solution

Looks safe to me. Your i = place.fetch_add(1) allocates each array index exactly once, each one to one of the threads. So for any given array element, it's only touched by a single thread, and that's guaranteed to be safe for all types other than bit-fields of a struct¹.

Footnote 1: Or elements of std::vector<bool> which the standard unfortunately requires to be a packed bit-vector, breaking some of the usual guarantees for std::vector.

There's no need for any ordering of these accesses while the worker threads are working; the main thread join()s the workers before reading the array, so everything done by the workers "happens before" (in ISO C++ standardese) the main thread's std::cout << items[i] accesses.

Of course, the array elements are all written by the main thread before the worker threads are started, but that's also safe because the std::thread constructor makes sure all earlier stuff in the parent threads happens-before anything in the new thread

The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.

There's also no need for any order stronger than mo_relaxed on the increment: it's the only atomic variable in your program, and you don't need any ordering of any of your operations except the overall thread creation and join.

It's still atomic, so it's guaranteed that 100 increments will produce the numbers 0..99, simply no guarantee about which thread gets which. (But there is a guarantee that each thread will see monotonically increasing values: for a every atomic object separately, a modification order exists, and that order is consistent with some interleaving of the program-order of modifications to it.)

Just for the record, this is hilariously inefficient compared to having each worker pick a contiguous range of indices to square. That would only take 1 atomic access per thread, or the main thread could just pass them positions.

And it would avoid all the false sharing effects of having 4 threads loading and storing into the same cache line at the same time as they move through the array.

Contiguous ranges would also let the compiler auto-vectorize with SIMD to load/square/store multiple elements at once.