Busy loop and the barrier

void loop(int loops) 
    {
      while (loops-- > 0)
        asm volatile ("" : : : "memory")
    }

I know that asm volatile ("" : : : "memory") prevents the compiler reorder instruction. But, here, I cannot see what can be reordered and why it could be problematic in terms of concurrency. ( I take into account possible interrupts). So, why is there a barrier?
And second, the connected issue. Let's assume that we have a 10000000 line piece of code ( see below). As we know the CPU can reorder a StoreLoad.

mov [eax], $2; nop; nop; ...; nop; mov ebx, [ecx];

How deeply does the CPU is able to predict that there is a chance to apply StoreLoad?

The same question can be applied to the compiler, but it concerns various reordering ( not only StoreLoad and not only memory operations )

Solution

TL:DR: The problem here is that you're only thinking about it as std::atomic_thread_fence(std::memory_order_seq_cst), but that's not the only thing GNU C volatile asm statements do.

Yes, obviously the barrier is there to make a nasty busy-wait delay loop. Remember that a volatile asm statement can't be reordered with any other C statements, not just memory operations.

Godbolt

void loop_nomemclobber(int loops) {
  do {     // loop rearranged for simpler asm
    asm volatile ("" : : : /* "memory" */ );
  } while (--loops > 0);
}

loop_nomemclobber:
.L3:
    sub     edi, 1
    test    edi, edi
    jg      .L3
    ret

We still get a loop even without forcing all reachable memory to be up-to-date and treated as clobbered. So the reason the asm volatile statement does this has nothing to do with the "memory" clobber.

int loops is a local with automatic storage. The compiler can prove that nothing (including the asm statement) has any way to determine where it might be in memory, so it doesn't have to be in memory at all.

How deeply does the CPU is able to predict that there is a chance to apply StoreLoad?

The CPU doesn't go looking for chances to reorder memory for no reason! Reordering happens naturally (unless prevented with MFENCE) because the CPU needs to buffer stores until it's certain that they're not speculative, and on cache-miss stores. So it puts stores in the Store Buffer, and they eventually commit to cache.

There isn't a little demon inside the CPU saying "aha, here's another chance to make things difficult for Gilgamesz, maybe I'll really trick him this time with this reordering!"

There is a real question here, and that's how far apart two instructions need to be (in time, or in number of insns, or number of intervening loads/stores) before a specific microarchitecture doesn't have enough out-of-order resources for that store to ever be buffered until after that load.

I don't know, but since StoreStore reordering isn't allowed, a cache-miss store to a highly-contended cache line can't sit there waiting to gain access to the cache line while millions of other instructions run. Unless none of those instructions were a store.

I don't know the answer, but I think it's plausible that a store could theoretically be delayed for millions of cycles on Intel Haswell, maybe bounded only by the fairness algorithms of the hardware arbitration mechanisms that handle the case where multiple cores contend for access to the same cache line.

I forget what I've read about whether modern Intel hardware works this way or not, but I think maybe a store can retire from the out-of-order core but still not have committed to L1 cache. Instead, it's only in the store queue as a store that will definitely happen. That would let cache-miss stores avoid blocking new instructions from entering the ROB. (Loads need to probe the store buffer to preserve correct execution within the single core, but doing that doesn't require the stores to also be tracked by the ROB).