How many memory barriers do we need to implement a Peterson lock?

I'm trying to figure out how many memory fences do we need to implement a Peterson lock. Clearly, we need at least one.

https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

In practice, it seems that one is sufficient, based on a number of tests executed in different architectures. However, in theory, do we need additional ones?

I have tried the code below

my peterson_lock failed in this situation

changing the order between Mark A by Mark B and it works! However, the memory fence does not capture the ordering between Mark A and Mark B. So, does it mean that the program is still incorrect?

#include <pthread.h>

typedef struct {
    volatile bool flag[2];
    volatile int victim;
} peterson_lock_t;

void peterson_lock_init(peterson_lock_t &lock) {
    lock.flag[0] = lock.flag[1] = false;
    lock.victim = 0;
}

void peterson_lock(peterson_lock_t &lock, int id) {
    lock.victim = id; // Mark as A
    lock.flag[id] = true; // Mark as B
    asm volatile ("mfence" : : : "memory");
    while (lock.flag[1 - id] && lock.victim == id);
}

void peterson_unlock(peterson_lock_t &lock, int id) {
    lock.flag[id] = false;
    lock.victim = id;
}

After replacing the order of lines "Mark as A" and "Mark as B" I expected the program to run almost always correctly, as it is now in agreement with the Wikipedia entry on Peterson lock.

https://en.wikipedia.org/wiki/Peterson%27s_algorithm

However, the memory fence does not protect the ordering between Mark A and Mark B. Therefore, is it still possible that the program is incorrect? If so, how to fix it?

Solution

Nobody uses a Peterson lock on mainstream platforms because mutexes are available. But assuming you cannot use those and you are writing code for an old X86 platform without access to modern primitives (no memory model, no mutexes, no atomic RMW operations), this algorithm might be considered.

Your implementation of the Peterson lock is incorrect (also after swapping the lines 'Mark as A' & 'Mark as B').
If you translate the Wikipedia pseudo code to C++, the correct implementation becomes:

typedef struct {
    volatile bool flag[2];
    volatile int victim;
} peterson_lock_t;

void peterson_lock(peterson_lock_t &lock, int id) {
    lock.flag[id] = true;
    lock.victim = 1-id;
    asm volatile ("mfence" ::: "memory"); // CPU #StoreLoad barrier
    while (lock.flag[1-id] && lock.victim == 1-id);
}

void peterson_unlock(peterson_lock_t &lock, int id) {
    asm volatile("" ::: "memory"); // compiler barrier
    lock.flag[id] = false;
}

In addition to the use of volatile on he lock variables, the mfence instruction (in peterson_lock) is necessary to prevent #StoreLoad reordering. This shows a rare case where an algorithm requires sequential consistency; i.e. operations on the lock variables must take place in a single total order.

The use of volatile is based on non-portable (but 'almost' correct) properties on gcc/X86. "'almost' correct" because even though a volatile store on X86 is a release operation on CPU level, the compiler can still reorder operations on volatile and non-volatile data.
For that reason, I added a compiler barrier before resetting lock.flag[id] in peterson_unlock.

But it is probably a good idea to use volatile on all data that is shared between threads using this algorithm, because the compiler can still perform store and load operations on non-volatile data in a CPU register only.

Note that with the use of volatile on shared data, the compiler barrier in peterson_unlock becomes redundant.