Search code examples
c++multithreadingc++11concurrencystandards

Does standard C++11 guarantee that memory_order_seq_cst prevents StoreLoad reordering of non-atomic around an atomic?


Does standard C++11 guarantee that memory_order_seq_cst prevents StoreLoad reordering around an atomic operation for non-atomic memory accesses?

As known, there are 6 std::memory_orders in C++11, and its specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation - Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§ 29.3 Order and consistency

§ 29.3 / 1

The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in 1.10 and may provide for operation ordering. Its enumerated values and their meanings are as follows:

Also known, that these 6 memory_orders prevent some of these reordering:

enter image description here

But, does memory_order_seq_cst prevent StoreLoad reordering around an atomic operation for regular, non-atomic memory accesses or only for other atomic with the same memory_order_seq_cst?

I.e. to prevent this StoreLoad-reordering should we use std::memory_order_seq_cst for both STORE and LOAD, or only for one of it?

std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // Sequential Consistency
a.load(std::memory_order_seq_cst); // Sequential Consistency

About Acquire-Release semantic is all clear, it specifies exactly non-atomic memory-access reordering across atomic operations: http://en.cppreference.com/w/cpp/atomic/memory_order


To prevent StoreLoad-reordering we should use std::memory_order_seq_cst.

Two examples:

  1. std::memory_order_seq_cst for both STORE and LOAD: there is MFENCE

StoreLoad can't be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/mVZJs0

std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // can't be executed after LOAD
a.load(std::memory_order_seq_cst); // can't be executed before STORE
  1. std::memory_order_seq_cst for LOAD only: there isn't MFENCE

StoreLoad can be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/2NLy12

std::atomic<int> a, b;
b.store(1, std::memory_order_release); // can be executed after LOAD
a.load(std::memory_order_seq_cst); // can be executed before STORE

Also if C/C++-compiler used alternative mapping of C/C++11 to x86, which flushes the Store Buffer before the LOAD: MFENCE,MOV (from memory), so we must use std::memory_order_seq_cst for LOAD too: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html As this example is discussed in another question as approach (3): Does it make any sense instruction LFENCE in processors x86/x86_64?

I.e. we should use std::memory_order_seq_cst for both STORE and LOAD to generate MFENCE guaranteed, that prevents StoreLoad reordering.

Is it true, that memory_order_seq_cst for atomic Load or Store:

  • specifi Acquire-Release semantic - prevent: LoadLoad, LoadStore, StoreStore reordering around an atomic operation for regular, non-atomic memory accesses,

  • but prevent StoreLoad reordering around an atomic operation only for other atomic operations with the same memory_order_seq_cst?


Solution

  • No, standard C++11 doesn't guarantee that memory_order_seq_cst prevents StoreLoad reordering of non-atomic around an atomic(seq_cst).

    Even standard C++11 doesn't guarantee that memory_order_seq_cst prevents StoreLoad reordering of atomic(non-seq_cst) around an atomic(seq_cst).

    Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

    • There shall be a single total order S on all memory_order_seq_cst operations - C++11 Standard:

    § 29.3

    3

    There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens before” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: ...

    • But, any atomic operations with ordering weaker than memory_order_seq_cst hasn't sequential consistency and hasn't single total order, i.e. non-memory_order_seq_cst operations can be reordered with memory_order_seq_cst operations in allowed directions - C++11 Standard:

    § 29.3

    8 [ Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data races and uses exclusively memory_order_seq_cst operations. Any use of weaker ordering will invalidate this guarantee unless extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomic operations with weaker ordering specifications. — end note ]


    Also C++-compilers allows such reorderings:

    1. On x86_64

    Usually - if in compilers seq_cst implemented as barrier after store, then:

    STORE-C(relaxed); LOAD-B(seq_cst); can be reordered to LOAD-B(seq_cst); STORE-C(relaxed);

    Screenshot of Asm generated by GCC 7.0 x86_64: https://godbolt.org/g/4yyeby

    Also, theoretically possible - if in compilers seq_cst implemented as barrier before load, then:

    STORE-A(seq_cst); LOAD-C(acq_rel); can be reordered to LOAD-C(acq_rel); STORE-A(seq_cst);

    1. On PowerPC

    STORE-A(seq_cst); LOAD-C(relaxed); can be reordered to LOAD-C(relaxed); STORE-A(seq_cst);

    Also on PowerPC can be such reordering:

    STORE-A(seq_cst); STORE-C(relaxed); can reordered to STORE-C(relaxed); STORE-A(seq_cst);

    If even atomic variables are allowed to be reordered across atomic(seq_cst), then non-atomic variables can also be reordered across atomic(seq_cst).

    Screenshot of Asm generated by GCC 4.8 PowerPC: https://godbolt.org/g/BTQBr8


    More details:

    1. On x86_64

    STORE-C(release); LOAD-B(seq_cst); can be reordered to LOAD-B(seq_cst); STORE-C(release);

    Intel® 64 and IA-32 Architectures

    8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

    I.e. x86_64 code:

    STORE-A(seq_cst);
    STORE-C(release); 
    LOAD-B(seq_cst);
    

    Can be reordered to:

    STORE-A(seq_cst);
    LOAD-B(seq_cst);
    STORE-C(release); 
    

    This can happen because between c.store and b.load isn't mfence:

    x86_64 - GCC 7.0: https://godbolt.org/g/dRGTaO

    C++ & asm - code:

    #include <atomic>
    
    // Atomic load-store
    void test() {
        std::atomic<int> a, b, c;
        a.store(2, std::memory_order_seq_cst);          // movl 2,[a]; mfence;
        c.store(4, std::memory_order_release);          // movl 4,[c];
        int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
    }
    

    It can be reordered to:

    #include <atomic>
    
    // Atomic load-store
    void test() {
        std::atomic<int> a, b, c;
        a.store(2, std::memory_order_seq_cst);          // movl 2,[a]; mfence;
        int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
        c.store(4, std::memory_order_release);          // movl 4,[c];
    }
    

    Also, Sequential Consistency in x86/x86_64 can be implemented in four ways: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

    1. LOAD (without fence) and STORE + MFENCE
    2. LOAD (without fence) and LOCK XCHG
    3. MFENCE + LOAD and STORE (without fence)
    4. LOCK XADD ( 0 ) and STORE (without fence)
    • 1 and 2 ways: LOAD and (STORE+MFENCE)/(LOCK XCHG) - we reviewed above
    • 3 and 4 ways: (MFENCE+LOAD)/LOCK XADD and STORE - allow next reordering:

    STORE-A(seq_cst); LOAD-C(acq_rel); can be reordered to LOAD-C(acq_rel); STORE-A(seq_cst);


    1. On PowerPC

    STORE-A(seq_cst); LOAD-C(relaxed); can be reordered to LOAD-C(relaxed); STORE-A(seq_cst);

    Allows Store-Load reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

    Stores Reordered After Loads

    I.e. PowerPC code:

    STORE-A(seq_cst);
    STORE-C(relaxed); 
    LOAD-C(relaxed); 
    LOAD-B(seq_cst);
    

    Can be reordered to:

    LOAD-C(relaxed);
    STORE-A(seq_cst);
    STORE-C(relaxed); 
    LOAD-B(seq_cst);
    

    PowerPC - GCC 4.8 : https://godbolt.org/g/xowFD3

    C++ & asm - code:

    #include <atomic>
    
    // Atomic load-store
    void test() {
        std::atomic<int> a, b, c;       // addr: 20, 24, 28
        a.store(2, std::memory_order_seq_cst);          // li r9<-2; sync; stw r9->[a];
        c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
        c.load(std::memory_order_relaxed);              // lwz r9<-[c];
        int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
    }
    

    By dividing a.store into two parts - it can be reordered to:

    #include <atomic>
    
    // Atomic load-store
    void test() {
        std::atomic<int> a, b, c;       // addr: 20, 24, 28
        //a.store(2, std::memory_order_seq_cst);            // part-1: li r9<-2; sync;
        c.load(std::memory_order_relaxed);              // lwz r9<-[c];
        a.store(2, std::memory_order_seq_cst);          // part-2: stw r9->[a];
        c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
        int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
    }
    

    Where load-from-memory lwz r9<-[c]; executed earlier than store-to-memory stw r9->[a];.


    Also on PowerPC can be such reordering:

    STORE-A(seq_cst); STORE-C(relaxed); can reordered to STORE-C(relaxed); STORE-A(seq_cst);

    Because PowerPC has weak memory ordering model - allows Store-Store reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

    Stores Reordered After Stores

    I.e. on PowerPC operations Store can be reordered with other Store, then previous example can be reordered such as:

    #include <atomic>
    
    // Atomic load-store
    void test() {
        std::atomic<int> a, b, c;       // addr: 20, 24, 28
        //a.store(2, std::memory_order_seq_cst);            // part-1: li r9<-2; sync;
        c.load(std::memory_order_relaxed);              // lwz r9<-[c];
        c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
        a.store(2, std::memory_order_seq_cst);          // part-2: stw r9->[a];
        int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
    }
    

    Where store-to-memory stw r9->[c]; executed earlier than store-to-memory stw r9->[a];.