java algorithm performance jit microbenchmark

How does Java's JIT compiler make bad code faster than good code?

Background

This is a research that started from a Data Structures and Algorithms lecture. I apologize for the long background, but it is necessary to understand the question.

The partitioning algorithms Lomuto and Hoare were benchmarked to see if their running time corresponds to what the theory predicts. Analysis shows that they perform the following number of swaps, depending on the length of the input array n, are approximately:

Lomuto: n/2
Hoare: n/4; if big and small numbers are equally distributed in the array
Hoare: n/6; if big and small numbers are not equally distributed in the array

Swapping is the most expensive operation of these algorithms, so it is a good estimate for how they compare in terms of running time. This means that if the array is random, having big and small numbers equally dstributed, and Hoare has a running time of 20 micros, Lomuto should take approximately 40 micros, which is 2 times slower.

The microbenchmark:

public static double averageRuntimeTests(Function<int[], Integer> algorithm, int size,
                                         int warmupIterations, int measureIterations, int repeatedExecs) {
    long start, end;
    int result = -1;
    double[] runningTimes = new double[measureIterations];

    // Warmup
    for (int i = 0; i < warmupIterations; i++) {
        int[] A = randomArray(size);
        for (int j = 0; j < repeatedExecs; j++)
            result = algorithm.apply(A);
        System.out.print(result);
    }

    // Measure
    for (int i = 0; i < measureIterations; i++) {
        int[] A = randomArray(size);
        start = System.nanoTime();
        for (int j = 0; j < repeatedExecs; j++)
            result = algorithm.apply(A);
        end = System.nanoTime();
        System.out.print(result);
        runningTimes[i] = (end - start) / (repeatedExecs * 1000.0);
    }

    return average(runningTimes);
}

The microbenchmark corresponds with the theory, if the algorithms are run sufficiently many times on the same input array such that the JIT compiler "has enough time" to optimize the code. See the following code, where the algorithms are run 30 times for each different array.

>>> case 1

public static void main(String[] args) {
    int size = 10_000, warmupIterations = 10_000, measureIterations = 2_000, repeatedExecs = 30;

    System.out.printf("input size = %d%n", size);
    System.out.printf("#warmup iterations = %d%n", warmupIterations);
    System.out.printf("#measure iterations = %d%n", measureIterations);
    System.out.printf("#repeated executions = %d%n", repeatedExecs);

    double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
            measureIterations, repeatedExecs);
    double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
            measureIterations, repeatedExecs);

    System.out.printf("%nHoare: %f us/op%n", timeHoare);
    System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}

Result:

Lomuto: 7.94 us/op
Hoare: 3.7685 us/op

Lomuto is approximately 2 times slower than Hoare, as expected for a uniformly distributed array.

Result when Lomuto runs before Hoare:

Lomuto: 13.513 us/op
Hoare: 3.5865 us/op

Lomuto is 4 times slower than Hoare, more than it should. For some reason it takes almost double the time if it runs before Hoare.

Problem

However, if the algorithms run just one single time for each different input array, the JIT compiler behaves unexpectedly.

>>> case 2

public static void main(String[] args) {
    int size = 10_000, warmupIterations = 300_000, measureIterations = 60_000, repeatedExecs = 1;

    System.out.printf("input size = %d%n", size);
    System.out.printf("#warmup iterations = %d%n", warmupIterations);
    System.out.printf("#measure iterations = %d%n", measureIterations);
    System.out.printf("#repeated executions = %d%n", repeatedExecs);

    double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
            measureIterations, repeatedExecs);
    double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
            measureIterations, repeatedExecs);

    System.out.printf("%nHoare: %f us/op%n", timeHoare);
    System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}

Result (whether Hoare runs before Lomuto or not):

Lomuto: 26.676133 us/op
Hoare: 31.8233 us/op

It is shocking to see that Lomuto is even faster than Hoare! What is the JIT compiler doing here?

I am constantly saying that the JIT compiler is to be blamed, because if I disable it completely and run in interpreter-only mode (using the -Djava.compiler=NONE flag) the benchmark is again as expected. Running algorithms one single time for each different array...

>>> case 3

Result (whether Hoare runs before Lomuto or not):

Lomuto: 597.76 us/op
Hoare: 254.0455 us/op

As you can see Lomuto is again approximately 2 times slower, as expected.

Could someone please explain what is going on with the JIT compiler in case 2? It looks like the JIT compiler just partially optimizes the code. But then, why is Lomuto as fast as Hoare? Shouldn't Hoare be still faster?

Please note:

I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.
micros is a shorthand for microseconds and is the same as us.

Solution

How does Java's JIT compiler run bad code faster than good code?

The JIT compiler doesn't run code. It compiles code.
What you are seeing is not (valid) evidence of bad code running faster than good code.

I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.

It is probably not what the JIT compiler does that matters. It is probably when it does it that is causing the problems.

The JIT compiler uses CPU time to optimize. If you implement a micro-benchmark in the naive way, part or all of the JIT compiler time will be included in one (or more) of the iterations of your benchmark. That will make the benchmark iteration (or iterations) to appear to take longer than an earlier or later iteration. This distorts the apparent execution time.

When you think about it, your Java benchmark runs at three or more apparent¹ speeds:

It starts out running slowly because the JVM is interpreting bytecodes and gathering statistics.
Once it has gathered enough stats, the JVM (apparently) runs really slowly while the JIT compiler is compiling and optimizing bytecodes.
Once the JIT compiler has done its job, the JVM starts executing the compiled native code, and the code runs much faster.

If you use jmh (or similar) it should compensate for the effects of JIT compilation and other anomalies. If you want information, please read How do I write a correct micro-benchmark in Java? ... which helps to explain the common pitfalls.

^{1 - I am referring to apparent speed obtained using before and after system clock measurements.}