C++20 mutex with atomic wait

From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?

UPDATE My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?

Solution

A lock-free mutex is a contradiction.

You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.

But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.

I mean that the implementation could be lock free on the happy path (locking from not locked state)

std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.

Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)

There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)

(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)

A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.

If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep

And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)

See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.

Linux / libstdc++ behaviour of notify/wait

I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).

std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).

Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.

I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?

Test code:

#include <atomic>
#include <thread>
#include <unistd.h>   // for quick & dirty sleep and usleep.  TODO: port to non-POSIX.
#include <mutex>

#if 1        // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;

void waiter(){
    volatile unsigned sink;
    while (1) {
        sink = gvar;
        gvar.wait(sink);   // block until it's not the old value.
        // on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL )  or whatever the current counter value is
    }
}

void notifier(){
    while(1){
        sleep(1);
        gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
        gvar.notify_one();
    }
}
#else
std::mutex mut;

void waiter(){
    unsigned sink = 0;
    while (1) {
        mut.lock();     // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
        mut.unlock();
        sink++;
        usleep(sink);  // give the other thread plenty of time to take the lock, so we don't get it twice.
    }
}

void notifier(){
    while(1){
        mut.lock();
        sleep(1);
        mut.unlock();
    }
}
#endif

int main(){
    std::thread t (waiter);   // comment this to test the no-contention case
    notifier();
    // loops forever: kill it with control-C
}

Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)

If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)

Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.

If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.

It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.