Memory barriers: How to ensure initialization writes are seen by worker threads?

I'm fairly new to programming with memory barriers/fences, and I was wondering how we can guarantee that setup writes are visible in worker functions subsequently run on other CPUs. For example, consider the following:

int setup, sheep;

void SetupSheep():    // Run once
    CPU 1: setup = 0;
    ... much later
    CPU 1: sheep = 9;
    CPU 1: std::atomic_thread_fence(std::memory_order_release);
    CPU 1: setup = 1;

Run afterwards (not concurrently), many, many times:

void ManipulateSheep():
    CPU 2: int mySetup = setup;
    CPU 2: std::atomic_thread_fence(std::memory_order_acquire);
    CPU 2: // Use sheep...

On CPU 2, if mySetup is 1, sheep is then guaranteed to be 9 -- but how can we guarantee that mySetup is not 0?

So far, all I can think of is to spin-wait on CPU 2 until setup is 1. But this seems quite ugly given that the spin-wait would only have to wait the first time ManipulateSheep() was called. Surely there must be a better way?

Note there's also a symmetrical problem with uninitialization code: Say you're writing a lock-free data structure which allocates memory during its lifetime. In the destructor (assuming all threads have finished calling methods), you want to deallocate all the memory, which means that you need the CPU that's running the destructor to have the latest variable values. It's not even possible to spin-wait in that scenario since the destructor would have no way of knowing what the "latest" state was in order to check for it.

Edit: I guess what I'm asking is: Is there a way to say "Wait for all my stores to propagate to other CPUs" (for initialization) and "Wait for all stores to propagate to my CPU" (for uninitialization)?

Solution

It turns out that #StoreLoad is exactly the right barrier for this situation. As explained simply by Jeff Preshing:

A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier.

In C++11, std::atomic_thread_fence(std::memory_order_seq_cst) apparently acts as a #StoreLoad barrier (as well as the other three: #StoreStore, #LoadLoad, and #LoadStore). See this C++11 draft paper.

Side note: On x86, the mfence instruction acts as a #StoreLoad; this can generally be emitted with the _mm_fence() compiler intrinsic if need be.

So a pattern for lock-free code might be:

Initialize:
    CPU 1: setupStuff();
    CPU 1: std::atomic_thread_fence(std::memory_order_seq_cst);

Run parallel stuff

Uninitialize:
    CPU 2: std::atomic_thread_fence(std::memory_order_seq_cst);
    CPU 2: teardownStuff();