I'm trying to figure out how many memory fences do we need to implement a Peterson lock. Clearly, we need at least one.
https://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/
In practice, it seems that one is sufficient, based on a number of tests executed in different architectures. However, in theory, do we need additional ones?
I have tried the code below
my peterson_lock failed in this situation
changing the order between Mark A by Mark B and it works! However, the memory fence does not capture the ordering between Mark A and Mark B. So, does it mean that the program is still incorrect?
#include <pthread.h>
typedef struct {
volatile bool flag[2];
volatile int victim;
} peterson_lock_t;
void peterson_lock_init(peterson_lock_t &lock) {
lock.flag[0] = lock.flag[1] = false;
lock.victim = 0;
}
void peterson_lock(peterson_lock_t &lock, int id) {
lock.victim = id; // Mark as A
lock.flag[id] = true; // Mark as B
asm volatile ("mfence" : : : "memory");
while (lock.flag[1 - id] && lock.victim == id);
}
void peterson_unlock(peterson_lock_t &lock, int id) {
lock.flag[id] = false;
lock.victim = id;
}
After replacing the order of lines "Mark as A" and "Mark as B" I expected the program to run almost always correctly, as it is now in agreement with the Wikipedia entry on Peterson lock.
https://en.wikipedia.org/wiki/Peterson%27s_algorithm
However, the memory fence does not protect the ordering between Mark A and Mark B. Therefore, is it still possible that the program is incorrect? If so, how to fix it?
Nobody uses a Peterson lock on mainstream platforms because mutexes are available.
But assuming you cannot use those and you are writing code for an old X86
platform without access to modern primitives (no memory model, no mutexes, no atomic RMW operations), this algorithm might be considered.
Your implementation of the Peterson lock is incorrect (also after swapping the lines 'Mark as A' & 'Mark as B').
If you translate the Wikipedia pseudo code to C++
, the correct implementation becomes:
typedef struct {
volatile bool flag[2];
volatile int victim;
} peterson_lock_t;
void peterson_lock(peterson_lock_t &lock, int id) {
lock.flag[id] = true;
lock.victim = 1-id;
asm volatile ("mfence" ::: "memory"); // CPU #StoreLoad barrier
while (lock.flag[1-id] && lock.victim == 1-id);
}
void peterson_unlock(peterson_lock_t &lock, int id) {
asm volatile("" ::: "memory"); // compiler barrier
lock.flag[id] = false;
}
In addition to the use of volatile
on he lock
variables, the mfence
instruction (in peterson_lock
) is necessary to prevent #StoreLoad reordering.
This shows a rare case where an algorithm requires sequential consistency; i.e. operations on the lock
variables must take place in a single total order.
The use of volatile
is based on non-portable (but 'almost' correct) properties on gcc/X86
.
"'almost' correct" because even though a volatile
store on X86
is a release operation on CPU level, the compiler can still reorder operations on volatile
and non-volatile
data.
For that reason, I added a compiler barrier before resetting lock.flag[id]
in peterson_unlock
.
But it is probably a good idea to use volatile
on all data that is shared between threads using this algorithm,
because the compiler can still perform store and load operations on non-volatile
data in a CPU register only.
Note that with the use of volatile
on shared data, the compiler barrier in peterson_unlock
becomes redundant.