x86_64 memory barrier on single core

On x86_64, the intel documentation, section 8.2.3.2, vol 3A, says:

The Intel-64 memory-ordering model allows neither loads nor stores to be reordered with the same kind of operation. That is, it ensures that loads are seen in program order and that stores are seen in program order

I need to be sure that a variable won't be rearranged while writing to a memory address.

I want to avoid atomic xchg because of the high cost it involves. And in my application, the other cpu reading that value knows how to deal with an incomplete state.

Some code:

cli();
compiler_unoptimization(); // asm volatile("":::"memory")
volatile uint *p = 0x86648664; // address doesn't matter
*p = 1;
... // some code here
*p = 0;
sti();

So, am I right assuming that:

1) the cpu won't make *p = 0 before *p = 1, without the need of a sfence
2) the compiler (gcc or clang) won't inverse the p writing either with the asm trick (which is needed here, right?).

Solution

While the C standard guarantees issuing the accesses for volatile objects in-order, it does not guarantee it compared to non-volatile objects.

You have both accesses here volatile, so the compiler has to generate these in-order, but anything in the ellipsis can be moved around freely **unless these are volatile, too!

Also volatile does not imply, the hardware will execute in-order as of the C standard. This would be guaranteed by an appropriate barrier for the CPU, but - depending on the architecture and barrier - it may not suffice for the rest of the hardware (caches, busses, memory system, etc.

For x86, ordering is guaranteed (not typical, though: many RISCs like e.g. ARM and PPC are more relaxed, thus require more carefully written code). As you only refer to a single CPU and volatile has no side-effect here outside it, the memory system is not relevant. So you are on the safe side here.

Things are much more complicated for memory-mapped peripherals and multiprocessors, i.e. if you have side-effects beyond the single CPU. Simple example: the first write may not go past the CPU cache, so anything reading the same memory page may only see the second write or none at all. volatile will be not enough here, you need atomic accesses and (possible) barriers.

For your code, you can either make all variables in the ellipsis volatile (inefficient), or add compiler barriers around them (after *p = 1; and before *p = 0;). This way the compiler will not move instructions beyond the barrier.

Finally: volatile does not guarantee atomic accesses. Thus, *p may not be written by a single instruction. (I would not emphasise this too much, as I assume uint is unsigned int, which is normally 32 bits on 32 or 64 bit x86 targets, but it will be an issue for 8 or 16 bit CPUs.) To be on the safe-side, use _Atomic types (since C11).

PS: Types like uint. The standard type unsigned is not significantly more to type, but everyone instantly knows what you mean. If you need a specific width, use stdint.h types. Here, you should even use _Bool/bool, as you seem to have just a single true/false flag.

Note that all those features are available for low-level code, too. Especially _Atomic (see stdatomic.h, too) are meant for such porpose and do normally not need any special libraries. Their usage is often not more complicated than the non-qualified types if they can be stored atomically, too (there are also macros which signal if a specific type is atomic anyway).