How can I evaluate performances of a lockless queue?

I have implemented a lockless queue using the hazard pointer methodology explained in http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf using GCC CAS instructions for the implementation and pthread local storage for thread local structures. I'm now trying to evaluate the performance of the code I have written, in particular I'm trying to do a comparison between this implementation and the one that uses locks (pthread mutexes) to protect the queue.
I'm asking this question here because I tried comparing it with the "locked" queue and I found that this has better performances with respect to the lockless implementation. The only test I tried is creating 4 thread on a 4-core x86_64 machine doing 10.000.000 random operations on the queue and it it significantly faster than the lockless version.

I want to know if you can suggest me an approach to follow, i.e. what kind of operation I have to test on the queue and what kind of tool I can use to see where my lockless code is wasting its time.

I also want to understand if it is possible that the performance are worse for the lockless queue just because 4 threads are not enough to see a major improvement...

Thanks

Solution

First point: lock-free programming doesn't necessarily improve speed. Lock-free programming (when done correctly) guarantees forward progress. When you use locks, it's possible for one thread to crash (e.g., go into an infinite loop) while holding a mutex. When/if that happens, no other thread waiting on that mutex can make any more progress. If that mutex is central to normal operation, you may easily have to restart the entire process before any more work can be done at all. With lock-free programming, no such circumstance can arise. Other threads can make forward progress, regardless of what happens in any one thread¹.

That said, yes, one of the things you hope for is often better performance -- but to see it, you'll probably need more than four threads. Somewhere in the range of dozens to hundreds of threads would give your lock-free code a much better chance of showing improved performance over a lock-based queue. To really do a lot of good, however, you not only need more threads, but more cores as well -- at least based on what I've seen so far, with four cores and well-written code, there's unlikely to be enough contention over a lock for lock-free programming to show much (if any) performance benefit.

Bottom line: More threads (at least a couple dozen) will improve the chances of the lock-free queue showing a performance benefit, but with only four cores, it won't be terribly surprising if the lock-based queue still keeps up. If you add enough threads and cores, it becomes almost inevitable that the lock-free version will win. The exact number of threads and cores necessary is hard to predict, but you should be thinking in terms of dozens at a minimum.

¹ At least with respect to something like a mutex. Something like a fork-bomb that just ate all the system resources might be able to deprive the other threads of enough resources to get anything done -- but some care with things like quotas can usually prevent that as well.