I'am facing a scalability issue on multicore system. My application is processing scientific data in parallel on a 4 physical core machine, 8 logical cores with hyperthreading activated. We launch 8 JVM, one per logical core (we'll probably switch to one JVM eventually to avoid JVM's overhead)
The issue is that the scalability is nearly linear up to 4 cores, but then we barely gain 10-20% performance by adding 4 more "logical cores".
I analysed threads behaviour by profiling the app and I see no locks or threads that are waiting too much. I also checked with pidstat and I don't see for instance excessive context switch overhead. More precisely there is almost not context switch on the java processes. CPU usage is super high reaching almost 100% which seems also ok.
My question is how to detect and analyse the cause of this bad scalability after exceeding the number of physical cores. Which tools and methods can I use to detect where is the contention, where should I look at and can I fix it somehow without changing to much the architecture of the application (for instance switching to one JVM per machine)
Thanks
Please be mindful that hyper-threading is not doubling the capacity of a single core. In fact there are tasks which perform worse when Hyper-Threading is ON.
The gain will be very dependent on the nature of work - more pipeline stalls will mean more opportunity to schedule another process in place of the stalled one.
As an example: totally random access to memory would yield more in terms of hyper-threading performance gains than very fast cpu intensive computation all within the same cache line.
Here are the things that two hardware threads share and therefore any will produce contention limiting any gains:
Another observation is that the operating system has to support SMT/HT otherwise it will not be able to schedule anything into additional cores or will schedule the wrong tasks.
When supported by the OS, there still is a chance for OS contentions on things like file handles, or network sockets. The more 'embarrassingly parallelizable' the nature of the work, the more opportunity to limit this contention. If however your work involves reading and/or writing to the same system resource, you will experience less gains.
Once you have brought all of these tasks into 1 JVM, your level of parallelism is going to be:
int cores = Runtime.getRuntime().availableProcessors();