Some confused regarding to Rust memory order

I have some questions regarding to Rust memory barrier, let's have a look about this example, based on the example, I made some changes:

use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Barrier};
use std::thread;

struct UsizePair {
    atom: AtomicUsize,
    norm: UnsafeCell<usize>,
}

// UnsafeCell is not thread-safe. So manually mark our UsizePair to be Sync.
// (Effectively telling the compiler "I'll take care of it!")
unsafe impl Sync for UsizePair {}

static NTHREADS: usize = 8;
static NITERS: usize = 1000000;

fn main() {
    let upair = Arc::new(UsizePair::new(0));

    // Barrier is a counter-like synchronization structure (not to be confused
    // with a memory barrier). It blocks on a `wait` call until a fixed number
    // of `wait` calls are made from various threads (like waiting for all
    // players to get to the starting line before firing the starter pistol).
    let barrier = Arc::new(Barrier::new(NTHREADS + 1));

    let mut children = vec![];

    for _ in 0..NTHREADS {
        let upair = upair.clone();
        let barrier = barrier.clone();
        children.push(thread::spawn(move || {
            barrier.wait();

            let mut v = 0;
            while v < NITERS - 1 {
                // Read both members `atom` and `norm`, and check whether `atom`
                // contains a newer value than `norm`. See `UsizePair` impl for
                // details.
                let (atom, norm) = upair.get();
                if atom != norm {
                    // If `Acquire`-`Release` ordering is used in `get` and
                    // `set`, then this statement will never be reached.
                    println!("Reordered! {} != {}", atom, norm);
                }
                v = atom;
            }
        }));
    }

    barrier.wait();

    for v in 1..NITERS {
        // Update both members `atom` and `norm` to value `v`. See the impl for
        // details.
        upair.set(v);
    }

    for child in children {
        let _ = child.join();
    }
}

impl UsizePair {
    pub fn new(v: usize) -> UsizePair {
        UsizePair {
            atom: AtomicUsize::new(v),
            norm: UnsafeCell::new(v),
        }
    }

    pub fn get(&self) -> (usize, usize) {
        let atom = self.atom.load(Ordering::Acquire); //Ordering::Acquire

        // If the above load operation is performed with `Acquire` ordering,
        // then all writes before the corresponding `Release` store is
        // guaranteed to be visible below.

        let norm = unsafe { *self.norm.get() };
        (atom, norm)
    }

    pub fn set(&self, v: usize) {
        unsafe { *self.norm.get() = v };

        // If the below store operation is performed with `Release` ordering,
        // then the write to `norm` above is guaranteed to be visible to all
        // threads that "loads `atom` with `Acquire` ordering and sees the same
        // value that was stored below". However, no guarantees are provided as
        // to when other readers will witness the below store, and consequently
        // the above write. On the other hand, there is also no guarantee that
        // these two values will be in sync for readers. Even if another thread
        // sees the same value that was stored below, it may actually see a
        // "later" value in `norm` than what was written above. That is, there
        // is no restriction on visibility into the future.

        self.atom.store(v, Ordering::Release); //Ordering::Release
    }
}

Basically, I just changed the judge condition into if atom != norm and the memory order in get and set method.

According to what I have learned so far, all the memory operations(1. doesn't require that these memory operations are operating on the same memory location, 2. no matter it is an atomic operation or normal memory operation) happens before a store Release, will be visible to the memory operation after a load Acquire.

I don't get why if atom != norm is not always true? Actually, from the comments in the example, it does point out that:

However, no guarantees are provided as to when other readers will witness the below store, and consequently the above write. On the other hand, there is also no guarantee that these two values will be in sync for readers. Even if another thread sees the same value that was stored below, it may actually see a "later" value in norm than what was written above. That is, there is no restriction on visibility into the future.

Can someone explain to me why norm can see some "future value"?

Also in this c++ example, is it the same reason that causes these statements in code?

v0, v1, v2 might turn out to be -1, some, or all of them.

Solution

all the memory operations ... happens before a store Release, will be visible to the memory operation after a load Acquire.

That's true only if the acquire load sees the value from the release store.

If not, the acquire load ran before the release store was globally visible, so there are no guarantees about anything; you didn't actually synchronize with that writer. The load of norm happens after the acquire load, so another store might have become globally visible¹ during that interval.

Also, the norm store is done first² so even if atom and norm were loaded simultaneously (e.g. by one wide atomic load), it would still be possible for it to see norm updated by atom not yet.

Footnote 1: (Or visible to this thread, on the rare machine where that can happen without being globally visible, e.g. PowerPC)

Footnote 2: The only actual guarantee is not-later; they could both become globally visible as one wider transaction, e.g. the compiler would be allowed to merge the norm store and the atom store into one wider atomic store, or hardware could do that via store coalescing in the store buffer. So there might never be a time interval when you could observe norm updated by atom not; it depends on the implementation (hardware and compiler).

(IDK what kind of guarantees Rust gives here or how it formally defines synchronization and memory order. But the basics of acquire and release synchronization are fairly universal. https://preshing.com/20120913/acquire-and-release-semantics/. In C++ reading a non-atomic norm at all without achieving synchronization would be data-race UB (undefined behaviour), but of course when compiled for real hardware the effects I describe are what would happen in practice, whether the source language is C++ or Rust.)