Search code examples
cgccmemory-barriersspinlock

Memory order for a ticket-taking spin-lock mutex


Suppose I have the following ticket-taking spinlock mutex implementation (in C using GCC atomic builtins). As I understand it, the use of the "release" memory order in the unlock function is correct. I'm unsure, though, about the lock function. Because this is a ticket-taking mutex, there's a field indicating the next ticket number to be handed out, and a field to indicate which ticket number currently holds the lock. I've used acquire-release on the ticket increment and acquire on the spin load. Is that unnecessarily strong, and if so, why?

Separately, should those two fields (ticket and serving) be spaced so that they're on different cache lines, or does that not matter? I'm mainly interested in arm64 and amd64.

typedef struct {
        u64 ticket;
        u64 serving;
} ticket_mutex;

void
ticket_mutex_lock(ticket_mutex *m)
{
        u64 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_ACQ_REL);
        while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE));
}

void
ticket_mutex_unlock(ticket_mutex *m)
{
        (void) __atomic_fetch_add(&m->serving, 1, __ATOMIC_RELEASE);
}

UPDATE: based on the advice in the accepted answer, I've adjusted the implementation to the following. This mutex is intended for the low-contention case.

typedef struct {
        u32 ticket;
        u32 serving;
} ticket_mutex;

void
ticket_mutex_lock(ticket_mutex *m)
{
        u32 my_ticket = __atomic_fetch_add(&m->ticket, 1, __ATOMIC_RELAXED);
        while (my_ticket != __atomic_load_n(&m->serving, __ATOMIC_ACQUIRE)) {
                #ifdef __x86_64__
                __asm __volatile ("pause");
                #endif
        }
}

void
ticket_mutex_unlock(ticket_mutex *m)
{
        u32 my_ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);
        (void) __atomic_store_n(&m->serving, my_ticket+1, __ATOMIC_RELEASE);
}

Solution

  • m->ticket increment only needs to be RELAXED. You only need each thread to get a different ticket number; it can happen as early or late as you want wrt. other operations in the same thread.

    load(&m->serving, acquire) is the operation that orders the critical section, preventing those from starting until we've synchronized-with a RELEASE operation in the unlock function of the previous holder of the lock. So the m->serving loads needs to be at least acquire.

    Even if the m->ticket++ doesn't complete until after an acquire load of m->serving, that's fine. The while condition still determines whether execution proceeds (non-speculatively) into the critical section. Speculative execution into the critical section is fine, and good since it probably means it's ready commit sooner, reducing the time with the lock held.

    Extra ordering on the RMW operation won't make it any faster locally or in terms of inter-thread visibility, and would slow down the thread taking the lock.


    One cache line or two

    For performance, I think with high contention, there are advantages to keeping the members in separate cache lines.

    Threads needing exclusive ownership of the cache line to get a ticket number won't contend with the thread unlocking .serving, so those inter-thread latency delays can happen in parallel.

    With multiple cores in the spin-wait while(load(serving)) loop, they can hit in their local L1d cache until something invalidates shared copies of the line, not creating any extra traffic. But wasting a lot of power unless you use something like x86 _mm_pause(), as well as wasting execution resources that could be shared with another logical core on the same physical. x86 pause also avoids a branch mispredict when leaving the spin loop. Related:

    Exponential backoff up to some number of pauses between checks is a common recommendation, but here we can do better: A number of pause instructions between checks that scales with my_ticket - m->serving, so you check more often when your ticket is coming up.

    In really high contention cases, fallback to OS-assisted sleep/wake is appropriate if you'll be waiting for long, like Linux futex. Or since we can see how close to the head of the queue we are, yield, nanosleep, or futex if your wait interval will be more than 3 or 8 ticket numbers or whatever. (Tunable depending on how long it takes to serve a ticket.)

    (Using futex, you might introduce a read of m->ticket into the unlock to figure out if there might be any threads sleeping, waiting for a notify. Like C++20 atomic<>.wait() and atomic.notify_all(). Unfortunately I don't know a good way to figure out which thread to notify, instead of waking them all up to check if they're the lucky winner.


    With low average contention, you should keep both in the same cache line. An access to .ticket is always immediately followed by a load of .serving. In the unlocked no-contention case, this means only one cache line is bouncing around, or having to stay hot for the same core to take/release the lock.

    If the lock is already held, the thread wanting to unlock needs exclusive ownership of the cache line to RMW or store. It loses this whether another core does an RMW or just a pure load on the line containing .serving.

    There won't be too many cases where multiple waiters are all spinning on the same lock, and where new threads getting a ticket number delay the unlock, and its visibility to the thread waiting for it.

    This is my intuition, anyway; it's probably hard to microbenchmark, unless a cache-miss atomic RMW stops later load from even starting to request the later line, in which case you could have two cache-miss latencies in taking the lock.


    Avoiding an atomic RMW in the unlock?

    The thread holding the lock knows it has exclusive ownership, no other thread will be modifying m->serving concurrently. If you had the lock owner remember its own ticket number, you could optimize the unlock to just a store.

    void ticket_mutex_unlock(ticket_mutex *m, uint32_t ticket_num)
    {
            (void) __atomic_store_n(&m->serving, ticket_num+1, __ATOMIC_RELEASE);
    }
    

    Or without that API change (to return an integer from u32 ticket_mutex_lock())

    void ticket_mutex_unlock(ticket_mutex *m)
    {
            uint32_t ticket = __atomic_load_n(&m->serving, __ATOMIC_RELAXED);  // we already own the lock
            // and no other thread can be writing concurrently, so a non-atomic increment is safe
            (void) __atomic_store_n(&m->serving, ticket+1, __ATOMIC_RELEASE);
    }
    

    This has a nice efficiency advantage on ISAs that need LL/SC retry loops for atomic RMWs, where spurious failure from another core reading the value can happen. And on x86 where the only possible atomic RMW is a full barrier, stronger even than needed for C seq_cst semantics.

    BTW, the lock fields would be fine as uint32_t. You're not going to have 2^32 threads waiting for a lock. So I used uint32_t instead of u64. Wrap-around is well-defined. Even subtraction like ticket - serving Just Works, even across that wrapping boundary, like 1 - 0xffffffffUL gives 2, so you can still calculate how close you are to being served, for sleep decisions.

    Not a big deal on x86-64, only saving a bit of code size, and probably not a factor at all on AArch64. But will help significantly on some 32-bit ISAs.