Why does JVM performance improve with more load?

We are seeing a behavior where the performance of the JVM decreases when the load is light. Specifically on multiple runs, in a test environment we are noticing that the latency worsens by around 100% when the rate of order messages pumped into the system is reduced. Some of the background on the issue is below and I would appreciate any help on this.

Simplistically the demo Java trading application being investigated can be thought to have 3 important threads: order receiver thread, processor thread, exchange transmitter thread

Order receiver thread receives the order and puts it on a processor q. the processor thread picks it up from the processor q, does some basic processing and puts it on the exchange q. the exchange transmitter thread picks it up from exchange q and sends order to the exchange.

The latency from the order receipt to the order going out to the exchange worsens by 100% when the rate of orders pumped into the system is changed from a higher number to a low number.

Solutions tried:

Warming up critical code path in JVM by sending high message rate and priming the system before reducing message rate: Does not solve the issue
Profiling the application: Using a profiler it shows hotspots in the code where 10 -15% improvement may be had by improving the implementation. But nothing in the range of 100% improvement just obtained by increasing message rate.

Does anyone have any insights/suggestions on this? Could it have to do with the scheduling jitter on the thread.

Could it be that under the low message rate the threads are being switched out from the core?

2 posts I think may be related are below. However our symptoms are a bit different:

is the jvm faster under load?

Why does the JVM require warmup?

Solution

Consistent latency for low/medium load requires specific tuning of Linux.

Below are few point from my old check list, which is relevant for components with millisecond latency requirements.

configure CPU core to always run and maximum frequency (here are docs for RedHat)
configure dedicated CPU cores for your critical application threads
- Use isolcpus to exclude dedicated cores from scheduler
- use taskset to bind critical thread to specific core
configure your service to run in single NUMA node (with numactl)

Linux scheduler and power sampling are key contributor to high variance of latency under low/medium low.

By default, CPU core would reduce frequency if inactive, as consequence your next request is processed slower on downclocked core.

CPU cache is key performance asset if your critical thread is scheduled on different cores it would lose its cache data. Also, other threads schedule for same core would evict cache also increasing latency of critical code.

Under heavy load these factors are less important (frequency is maxed and thread are ~100% busy tending to stick to specific cores).

Though under low/medium load these factors negatively affect both average latency and high percentiles (99 percentile may be order of magnitude worse compared to heavy load case).

For high throughput applications (above 100k request/sec) advanced inter thread communication approach (e.g. LMAX disruptor) are also useful.