Confusing Memory Reordering Behavior

I am trying to run a simple task (obtaining the x2APIC ID of the current processor) on every available hardware thread. I wrote the following code to do this, which works on the machines that I tested it on (see here for the complete MWE, compilable on Linux as C++11 code).

void print_x2apic_id()
{
        uint32_t r1, r2, r3, r4;
        std::tie(r1, r2, r3, r4) = cpuid(11, 0);
        std::cout << r4 << std::endl;
}

int main()
{
        const auto _ = std::ignore;
        auto nprocs = ::sysconf(_SC_NPROCESSORS_ONLN);
        auto set = ::cpu_set_t{};
        std::cout << "Processors online: " << nprocs << std::endl;

        for (auto i = 0; i != nprocs; ++i) {
                CPU_SET(i, &set);
                check(::sched_setaffinity(0, sizeof(::cpu_set_t), &set));
                CPU_CLR(i, &set);
                print_x2apic_id();
        }
}

Output on one machine (when compiled with g++, version 4.9.0):

Each iteration printed a different x2APIC ID, so things are working as expected. Now is where the problems start. I replaced the call to print_x2apic_id with the following code:

uint32_t r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

This causes the same ID to be printed for each iteration of the loop:

My guess as to what happened is that the compiler noticed that the call to cpuid does not depend on the loop iteration (even though it really does). The compiler then "optimized" the code by hoisting the call to CPUID outside the loop. To try to fix this, I converted r4 into an atomic:

std::atomic<uint32_t> r4;
std::tie(_, _, _, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

This failed to fix the problem. Surprisingly, this does fix the problem:

std::atomic<uint32_t> r1;
uint32_t r2, r3, r4;
std::tie(r1, r2, r3, r4) = cpuid(11, 0);
std::cout << r4 << std::endl;

... Ok, now I'm confused.

Edit: Replacing the asm statement in the cpuid function with asm volatile also fixes the issue, but I don't see how this should be necessary.

My Questions

Shouldn't inserting an acquire fence before the call to cpuid and a release fence after the call to CPUID be sufficient to prevent the compiler from performing the memory reordering?
Why didn't converting r4 to std::atomic<uint32_t> work? And why did storing the first three outputs into r1, r2, and r3 instead of ignoring them cause the program to work?
How can I correctly write the loop, using the least amount of synchronization necessary?

Solution

I've reproduced the problem with optimization enabled (-O). You are right suspecting the compiler optimization. CPUID serves as full memory barrier (for the processor) itself; but it is the compiler which generates the code without calling to cpuid function in the loop since it threats it as a constant function. asm volatile prevents compiler from such an optimization saying that it has side-effects.

See this answer for details: https://stackoverflow.com/a/14449998/2527797