How to make windows slim read writer lock fair?

i found out that windows implemented a slim reader-writer-lock (see https://msdn.microsoft.com/en-us/library/windows/desktop/aa904937%28v=vs.85%29.aspx ). Unfortunately (for me) this rw-lock is neither fifo nor is it fair (in any sense). Is there a possibility to make the windows rw-lock with some workaround fair or fifo? If not, in which scenarios would you use the windows slim rw-lock?

Solution

It is unlikely you can change the slim lock itself to be fair, especially since the documentation doesn't indicate any method of doing so, and most locks today are unfair for performance reasons.

That said, it is fairly straightforward to roll your own approximately FIFO lock with Windows events, and a 64-bit control word that you manipulate with compare and swap that is still very slim. Here's an outline:

The state of the lock is reflected in the control word is manipulated atomically to transition between the states, and allows threads to enter the lock (if allowed) with a single atomic operation and no kernel switch (that's the performance part of "slim"). The reset events are used to notify waiting threads, when threads need to block and can be allocated on-demand (that's the low memory footprint of slim).

The lock control word has the follow states:

Free - no readers or writers, and no waiters. Any thread can acquire the lock for reading or writing by atomically transitioning the lock into state (2) or (3).
N readers in the lock. There are N readers in the lock at the moment. New readers can immediately acquire the lock by adding 1 to the count - use a field of 30-bits or so within the control word to represent this count. Writers must block (perhaps after spinning). When readers leave the lock, they decrement the count, which may transition to state (1) when the last reader leaves (although they don't need to do anything special in a (2) -> (1) transition).
State (2) + waiting writers + 0 or more waiting readers. In this state, there are 1 or more readers still in the lock, but at least one waiting writer. The writers should wait on a manual-reset event, which is designed, although not guaranteed, to be FIFO. There is a field in the control word to indicate how many writers are waiting. In this state, new readers that want to enter the lock cannot, and set a reader-waiting bit instead, and block on the reader-waiting event. New writers increment the waiting writer count and block on the writer-waiting event. When the last reader leaves (setting the reader-count field to 0), it signals the writer-waiting event, releasing the longest-waiting writer to enter the lock.
Writer in the lock. When a writer is in the lock, all readers queue up and wait on the reader-waiting event. All incoming writers increment the waiting-writer count and queue up as usual on the writer-waiting event. There may even be some waiting readers when the writer acquires the lock because of state (3) above, and these are treated identically. When the writer leaves the lock, it checks for waiting writers and readers and either unblocks a writer or all readers, depending on policy, discussed below.

All the state transitions discussed above are done atomically using compare-and-swap. The typical pattern is that any of the lock() or unlock() calls look at the control word, determine what state they are in and what transition needs to happen (following the rules above), calculate the new control word in a temporary then attempt to swap in the new control word with compare-and-swap. Sometimes that attempt fails because another thread concurrently modified the control word (e.g., another reader entered the lock, incrementing the reader count). No problem, just start over from "determine state..." and try again. Such races are rare in practice since the state word calculation is very short, and that's just how things work with CAS-based complex locks.

This lock design is "slim" is almost every sense. Performance-wise, it is near the top of what you can get for a general purpose design. In particular, the common fast-paths of (a) reader entering the lock with 0 or more readers already in the block (b) reader leaving the lock with 0 or more readers still in the lock and (c) writer entering/leaving an uncontended lock are all about as fast as possible in the usual case: a single atomic operation. Furthermore, the reader entry and exit paths are "lock free" in the sense that incoming readers do not temporarily take an mutex internal to the rwlock, manipulate state, and then unlock it while entering/leaving the lock. This approach is slow and subject to issues whenever a reader thread performs a context switch at the critical moment in holds the internal lock. Such approaches do not scale to heaver reader activity with a short rwlock critical section: even though multiple readers can, in theory, enter the critical section, they all bottleneck on entering and leaving the internal lock (which happens twice for every enter/exit operation) and performance is worse than a normal mutex!

It is also lightweight in that it only needs a couple of Windows Event objects, and these objects can be allocated on demand - they are only needed when contention occurs and a state transition that requires blocking is about to occur. That's how CRITICAL_SECTION objects work.

The lock above is fair in the sense that readers won't block writers, and writers are served in FIFO order. How writers interact with waiting readers is up to your policy for who to unblock when the lock becomes free after a writer unlocks and there are both waiting readers and writers. On simple policy is to unblock all waiting readers.

In this design, writers will alternate in FIFO order with FIFO batches of readers. Writers are FIFO relative to other writers, and reader batches are FIFO relative to other reader batches, but the relationship between writers and readers isn't exactly FIFO: because all incoming readers are added to the same reader-waiting set, in the case that there are already several waiting writers, arriving readers all go into the next "batch" to be released, which actually puts them ahead of writers that are already waiting. That's quite reasonable though: readers all go at once, so adding more readers to the batch doesn't necessary cost much and probably increases efficiency, and if you did serve everything thread in strict FIFO order, the lock would reduce in behavior to a simple mutex under contention.

Another possible design is to always unblock writers if any are waiting. This favors writers at the expense of readers and does mean that a never-ending stream of writers could block out readers indefinitely. This approach makes sense where you know your writes are latency sensitive important and you either aren't worried about reader starvation, or you know it can't occur due to the design of your application (e.g., because there is only one possible writer at a time).

Beyond that, there are a lot of other policies possible, such as favoring writers up until readers have been waiting for a certain period, or limiting reader batch sizes, or whatever. They are mostly possible to implement efficiently since the bookkeeping is generally limited to the slow paths where threads will block anyway.

I've glossed over some implementation details and gotchas here (in particular, the need to be careful when making the transitions that involve blocking to avoid "missed wakeup" issues) - but this definitely works. I've written such a lock before the slim rwlock existed to fill the need for a fast high-performance rwlock, and it performs very well. Other tradeoffs are possible too, e.g., for designs in which reads are expected to dominate, contention can be reduced by splitting up the control word across cache lines, at the cost of more expensive write operations.

One final note - this lock is a bit fatter, in memory use, than the Windows one in the case that is contended - because it allocates one or two windows Events per lock, while the slim lock avoids this. The slim lock likely does it by directly supporting the slim lock behavior in the kernel, so the control word can directly be used as part of the kernel-side waitlist. You can't reproduce that exactly, but you can still remove the per-lock overhead in another way: use thread-local storage to allocate your two events per thread rather than per lock. Since a thread can only be waiting on one lock at a time, you only need this structure one per thread. That brings it into line with slim lock in memory use (unless you have very few locks and a ton of threads).