multithreading scheduling mutex atomic lock-free

When are lock free data structures less performant than mutual exclusion (mutexes)?

I read somewhere (can't find the page anymore) that lock free data structures are more efficient "for certain workloads" which seems to imply that sometimes they're actually slower or the gain from them can be zero in some situations. Taking the ~100 cycle hit of a lock instruction to do an atomic op sounds plenty faster to me than going to sleep and waiting for the scheduler to wake the process back up, so it's not obvious to me under what circumstances a lock free data structure would be less preferable than old fashioned mutexes. If the lock is available 99% of the time and the process doesn't have to go to sleep, is a mutex then faster? Is there a good rule of the thumb for knowing which way to go assuming a suitable lock free data structure is available?

Solution

A common approach to implementing a lock-free data structure is to have a mutable reference to an immutable object, and have anything that wants to change the structure grab the reference, produce a new version of the object with suitable changes applied, and then CompareExchange the reference to point to the new object. If the CompareExchange works, great. If not, ditch the new object, re-grab the reference, and start over.

This can work well if producing the new object is cheap and the level of contention is low enough that the CompareExchange will usually work. If there is considerable contention, and if producing the new object is slow, simultaneous attempted updates by N threads may take N^2 time to complete. As an extreme example, suppose 100 threads are running on a CPU, an update takes 100ms of CPU time (just over a time-slice), and all 100 threads attempt to update an object at once. During the first ten seconds, each thread will produce a new object based on the original one. One of the threads will successfully do the CompareExchange, while the others will all fail. Then during the next 9.9 seconds, 99 threads will generate new versions of the object, after which one will successfully post its update and 98 will fail. The net effect will be that the lock-free method will take 505 seconds' worth of CPU time to perform 100 updates, when a locking system could have done them in about 10 seconds.