If this code got no performance benefit at all from running multiple threads on multiple cores, I would not be scratching my head. But how can it actually run slower?
Look at the code first:
class ThreadSafeStack
def initialize
@s,@m = [],Mutex.new
end
def push(value)
@m.synchronize { @s.push(value) }
end
def pop
@m.synchronize { @s.pop }
end
def peek
@m.synchronize { @s.last }
end
end
The full benchmarking script is at https://github.com/alexdowad/showcase/blob/master/ruby-threads/concurrent_stack.rb. Basically, I do a million pushes, a million peeks, and a million pops, divided between 1, 5, or 25 threads (running in parallel).
Results from a 4-core Mac Pro, running JRuby 1.6.5.1:
Testing ThreadSafeStack with 1 thread, iterating 1000000x each
1.575000 0.000000 1.575000 ( 1.575000)
Testing ThreadSafeStack with 5 threads, iterating 200000x each
4.838000 0.000000 4.838000 ( 4.838000)
Testing ThreadSafeStack with 25 threads, iterating 40000x each
11.409000 0.000000 11.409000 ( 11.409000)
What gives???
EDIT: One more piece of information which might be relevant -- this benchmark does run faster with multiple threads when I use a lockless stack (implemented with compare-and-swap operations).
Because ... you're synchronizing?
Only one thread can do any one of those things at any time ... so you can't go faster than a single thread, and you've added the overhead of locks.
Edit to add from comments below because it's .. worth adding:
Locking is expensive. You have X number of threads now contending for the same resource. I'm not familiar with ruby internals to tell you exactly how they implement it but at least on *nix it should be a fairly straightforward path to a pthread_mutex. An uncontended lock can be handled in user space but a contended lock requires a call into the kernel; that's expensive and why it's that much slower, not to mention that every time a thread yields to wait on the lock it's more than likely that you're doing a context switch which is also expensive.