ruby multithreading performance stack multicore

Why does this code run slower with multiple threads, even on a multi-core machine?

If this code got no performance benefit at all from running multiple threads on multiple cores, I would not be scratching my head. But how can it actually run slower?

Look at the code first:

class ThreadSafeStack
  def initialize
    @s,@m = [],Mutex.new
  end
  def push(value)
    @m.synchronize { @s.push(value) }
  end
  def pop
    @m.synchronize { @s.pop }
  end
  def peek
    @m.synchronize { @s.last }
  end
end

The full benchmarking script is at https://github.com/alexdowad/showcase/blob/master/ruby-threads/concurrent_stack.rb. Basically, I do a million pushes, a million peeks, and a million pops, divided between 1, 5, or 25 threads (running in parallel).

Results from a 4-core Mac Pro, running JRuby 1.6.5.1:

Testing ThreadSafeStack with 1 thread, iterating 1000000x each
  1.575000   0.000000   1.575000 (  1.575000)
Testing ThreadSafeStack with 5 threads, iterating 200000x each
  4.838000   0.000000   4.838000 (  4.838000)
Testing ThreadSafeStack with 25 threads, iterating 40000x each
 11.409000   0.000000  11.409000 ( 11.409000)

What gives???

EDIT: One more piece of information which might be relevant -- this benchmark does run faster with multiple threads when I use a lockless stack (implemented with compare-and-swap operations).

Solution

Because ... you're synchronizing?

Only one thread can do any one of those things at any time ... so you can't go faster than a single thread, and you've added the overhead of locks.

Edit to add from comments below because it's .. worth adding:

Locking is expensive. You have X number of threads now contending for the same resource. I'm not familiar with ruby internals to tell you exactly how they implement it but at least on *nix it should be a fairly straightforward path to a pthread_mutex. An uncontended lock can be handled in user space but a contended lock requires a call into the kernel; that's expensive and why it's that much slower, not to mention that every time a thread yields to wait on the lock it's more than likely that you're doing a context switch which is also expensive.

Why does this code run *slower* with multiple threads, even on a multi-core machine?

Why does this code run slower with multiple threads, even on a multi-core machine?