multithreading parallel-processing multiprocessing spinlock busy-waiting

Why spinlocks can become performance issue in multithreaded programs?

I know what spinlocks are and that they use busy waiting. But why can it become a performance problem in multithreaded programs on a multicore processor?

And what can be done about it?

Solution

Your first issue, is that in cases where a spinlock protected section becomes contented, is usually a situation where there are more threads ready for execution than you have cores available. That means each thread wasting time in a spinlock is potentially starving another thread which would had something proper to do.

Then there is the cost of the spinock itself. You are burning through your budget of memory transactions, and that budget is actually shared between processor cores. Effectively, this can result in slowing down the operations within the critical sections.

A good example for that would be the memory allocator in the Windows kernel, in versions between 1703 and 1803. On systems with more than 16 threads, once a 50% total CPU utilization as exceeded, a spinlock in that path went out of control and would start eating up 90% of the CPU time. Time spent inside the critical section increased over tenfold due to the competing threads burning the memory bandwidth.

The naive solution is to use nano-sleeps in between spin cycles in order to at least reduce the performance burnt on the locks themselves. But that's pretty bad as well, as the cores still remain blocked, not doing any real work.

Try and yield in the spin locks instead? Just turns even slower, and you end up with a minimum delay proportional to the scheduling rate of the operating system. At a rate of 1ms (Windows realtime mode, active when any process requests it), 5ms (Linux default 200Hz scheduler), 10ms (Windows default mode), that's a huge delay this is introducing into execution. And if you happen to hit the critical section again, it was wasteful as you now added the overhead for context switch without any gains.

Ultimately, use operating system primitives for critical sections. The common approach is to use atomic operations to probe if any contention has occurred, and when it has, only then to involve the operating system.

Either way, the operating system below has better means to resolve the contention, mostly in the form of wait lists. Meaning threads waiting on a semaphore only wait up exactly when they are allowed to resume, and are guaranteed to hold the corresponding lock. When leaving the contended region, the thread owning the lock checks via lightweight means if there had been any contention, and only if so notifies the OS to resume operation on the other threads.

Not that you should actually reinvent the wheel though...

In Windows, that's already how Slim Reader/Writer Locks are implemented. If you use a plain std::mutex or alike, you will usually already end up with such mechanism under the hood.

"Old" literature (10-15 years) will still warn you not to use OS primitives for scheduling, but that's seriously outdated and does not reflect the improvements made on the OS side. What used to be 10ms+ delay for every context switch is essentially down to being barely measurable nowadays.