JRE 23 runs much faster than previous versions

I am trying the following code for performance checking. It is purely CPU-bound, doing lots of calculations on double types, single-threaded, and does not use any heap:

public class PerfTestSampleJ {
    private static final int MEASURE_COUNT = 5;
    private static final int ITERATIONS = 100_000_000;

    public static void main(String[] args) {
        var minTime = Long.MAX_VALUE;
        for (int i = 1; i <= MEASURE_COUNT; i++) {
            long start = System.nanoTime();
            double pi = calculatePi(ITERATIONS);
            long time = System.nanoTime() - start;
            System.out.printf("Iteration %2d took %8.3f ms%n", i, time / 1e6);
            if (time < minTime) {
                minTime = time;
            }
            if (Math.abs(pi - Math.PI) > 1e-14)
                throw new AssertionError(pi + " (" + (pi - Math.PI) + ")");
        }
        System.out.printf("Minimum time taken: %8.3f ms%n", minTime / 1e6);
    }

    private static double calculatePi(int iterations) {
        double pi = 0.0;
        double numerator = 4.0;
        for (int i = 1; i <= iterations; i++) {
            double n = i * 2.0;
            double denominator = n * (n + 1) * (n + 2);
            pi += numerator / denominator;
            numerator = -numerator;
        }
        return 3 + pi;
    }
}

Now, using the same compiled class file, compare the results when running under JRE 21 versus JRE 23:

/usr/lib/jvm/jdk-21.0.5-oracle-x64/bin/java PerfTestSampleJ
Iteration  1 took  801.058 ms
Iteration  2 took  798.392 ms
Iteration  3 took  414.688 ms
Iteration  4 took  413.959 ms
Iteration  5 took  416.867 ms
Minimum time taken:  413.959 ms

/usr/lib/jvm/jdk-23.0.1-oracle-x64/bin/java PerfTestSampleJ
Iteration  1 took  193.654 ms
Iteration  2 took  186.790 ms
Iteration  3 took  102.963 ms
Iteration  4 took  103.226 ms
Iteration  5 took  102.869 ms
Minimum time taken:  102.869 ms

In each run, there is a warmup phase in the first 2 iterations, but iteration 3 onwards is about as fast as it ever gets.

What has changed in Java 23 to make this faster? When looking at release notes, all I can find about performance is improvements in the garbage collector. But we're not using the heap here, so the garbage collector improvement is irrelevant.

P.S. The above results are on Ubuntu Linux x64 using an i7 processor. I get the same results using Temurin versions. Also, I tried Oracle JRE 22 vs 23 on Windows x64 with similar results, showing the performance difference is between 22 and 23.

Solution

A similar effect (JDK 23 being much faster than JDK 21) can be observed on a simplified JMH benchmark:

@Benchmark
public double compute() {
    double d = 0.0;
    for (int i = 1; i <= ITERATIONS; i++) {
        d += 1.0 / i;
    }
    return d;
}

To find out the reason, we will run the benchmark with -prof perfasm profiler and analyze the generated code. It includes 16 unrolled loop iterations, but for our purpose, it's enough to look at the first two:

JDK 21

0x00007f822049c323:   lea    0xf(%r8),%r10d
0x00007f822049c327:   lea    0xe(%r8),%r11d
0x00007f822049c32b:   vcvtsi2sd %r10d,%xmm0,%xmm0
0x00007f822049c330:   vdivsd %xmm0,%xmm2,%xmm3
0x00007f822049c334:   vcvtsi2sd %r11d,%xmm0,%xmm0
0x00007f822049c339:   vdivsd %xmm0,%xmm2,%xmm4

JDK 23

0x00007f46f442e1f3:   lea    0xf(%r8),%r10d
0x00007f46f442e1f7:   lea    0xe(%r8),%r11d
0x00007f46f442e1fb:   vpxor  %xmm0,%xmm0,%xmm0       (!)
0x00007f46f442e1ff:   vcvtsi2sd %r10d,%xmm0,%xmm0
0x00007f46f442e204:   vdivsd %xmm0,%xmm2,%xmm3
0x00007f46f442e208:   vpxor  %xmm0,%xmm0,%xmm0       (!)
0x00007f46f442e20c:   vcvtsi2sd %r11d,%xmm0,%xmm0
0x00007f46f442e211:   vdivsd %xmm0,%xmm2,%xmm4

The code is pretty much the same, except that JDK 23 version contains two extra vpxor instructions. How come extra instructions result in faster execution?

The clue is AVX vcvtsi2sd instruction that converts an integer to double. It has two source operands: one is a general purpose register with an integer, and the second one is SIMD register, where bits 64-127 are copied from. This creates redundant dependency on the source SIMD register, even though the subsequent code does not use higher bits.

xor'ing a register with itself is a cheap trick to zero a register, including its higher bits. This essentially breaks the dependency: hardware recognizes it no longer needs to care about bits 64-127 in vcvtsi2sd and subsequent vdivsd, as higher bits will be always zero.

This was a performance regression JDK-8318562 that was fixed in JDK 23 by this PR. You may find further explanation in the comments to this PR.

Interestingly enough, disabling AVX instructions with -XX:UseAVX=0 improves benchmark performance on JDK 21 and earlier.