The performance (throughput) of summing all elements in the array in for loop is slower on newer JVMs, than on JVM from Java 1.8.0 JDK. I performed JHM benchmark (plots below). Before each test, sources were compiled by provided javac.exe and run by java.exe, both binaries provided by selected JDK. Tests were performed on Windows 10 and launched by powershell script without any programs running in the background (no other jvms). The computer was equipped with 32GB of RAM, so virtual memory on HDD was not used.
Source code of my test:
@Param({"10000000", "100000000"})
public static int ELEMENTS;
public static void main(String[] args) throws RunnerException, IOException {
File outputFile = new File(args[0]);
int javaMajorVersion = Integer.parseInt(System.getProperty("java.version").split("\\.")[0]);
ChainedOptionsBuilder builder = new OptionsBuilder()
.include(IteratingBenchmark.class.getSimpleName())
.mode(Mode.Throughput)
.forks(2)
.measurementTime(TimeValue.seconds(10))
.measurementIterations(50)
.warmupTime(TimeValue.seconds(2))
.warmupIterations(10)
.resultFormat(ResultFormatType.SCSV)
.result(outputFile.getAbsolutePath());
if (javaMajorVersion > 8) {
builder = builder.jvmArgs("-Xms20g", "-Xmx20g", "--enable-preview");
} else {
builder = builder.jvmArgs("-Xms20g", "-Xmx20g");
}
new Runner(builder.build()).run();
}
@Benchmark
public static void cStyleForLoop(Blackhole bh, MockData data) {
long sum = 0;
for (int i = 0; i < data.randomInts.length; i++) {
sum += data.randomInts[i];
}
bh.consume(sum);
}
@State(Scope.Thread)
public static class MockData {
private int[] randomInts = new int[ELEMENTS];
@Setup(Level.Iteration)
public void setup() {
Random r = new Random();
this.randomInts = Stream.iterate(r.nextInt(), i -> i + r.nextInt(1022) + 1).mapToInt(Integer::intValue).limit(ELEMENTS).toArray();
}
}
Raw data:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;331,446104;5,563589;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;33,757268;0,431403;"ops/s";100000000
JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,728461;4,823611;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,075948;0,062830;"ops/s";100000000
JDK 12.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;322,914782;4,450969;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,095232;0,075051;"ops/s";100000000
JDK 13.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;325,103055;4,933257;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;31,228403;0,067954;"ops/s";100000000
JDK 14.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,861148;0,443404;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,863602;0,035781;"ops/s";100000000
OpenJDK 14.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;300,781930;0,481579;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;29,873509;0,033055;"ops/s";100000000
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;343,530895;0,445551;"ops/s";10000000
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;100;34,287083;0,035028;"ops/s";100000000
Is there any valid explanation, why newer versions of Java are slower than 1.8 (except OpenJDK 15)?
UPDATE 1:
I run same tests for different Xmx/Xms values (for each test Xmx == Xms), results below:
UPDATE 2:
Level.Iteration
to Level.Trial
.Raw data:
JDK 1.8.0_241:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;33,760346;0,089646;"ops/s";100000000
JDK 11.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,075120;0,086171;"ops/s";100000000
JDK 12.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,173939;0,044176;"ops/s";100000000
JDK 13.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;31,219283;0,062329;"ops/s";100000000
JDK 14.0.1:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,808609;0,072664;"ops/s";100000000
OpenJDK 14.0.2:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;29,845817;0,074315;"ops/s";100000000
OpenJDK 15:
"Benchmark";"Mode";"Threads";"Samples";"Score";"Score Error (99,9%)";"Unit";"Param: ELEMENTS"
"benchmark.IteratingBenchmark.cStyleForLoop";"thrpt";1;15;34,310620;0,087412;"ops/s";100000000
UPDATE 3:
I made GitHub Repository containing benchmark source code, and script to perform benchmark with JMH parameters used by me, which automatically generates plots in png format.
Additionally I performed benchmark on other machine (Linux).
Results from Linux machine seems to be more optimistic:
Unfortunately, on my Windows machine, results still show decreasing performance (excluding JDK 15).
Even after copying your benchmark verbatim from GitHub and running with the same parameters, I still cannot reproduce the results. In my environment, JDK 14 performs as fast as JDK 8 (even a little bit faster). So, in this answer I'll analyze the difference between both versions basing on the disassembly of the compiled code.
First, let's take the most recent OpenJDK builds from the same vendor.
Here I compare Liberica JDK 8u265+1 and Liberica JDK 14.0.2+13 for Windows 64 bit.
JMH scores are the following:
Benchmark (ELEMENTS) Mode Cnt Score Error Units
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 263,137 ± 0,484 ops/s # JDK 8
IteratingBenchmark.cStyleForLoop 10000000 thrpt 30 264,406 ± 0,788 ops/s # JDK 14
Now let's run JMH with built-in -prof xperfasm
profiler to see the disassembly of the hottest part of the benchmark. Expectedly, ~99.5% CPU time is spent in C2-compiled cStyleForLoop
method.
Hottest region on JDK 8
....[Hottest Region 1]..............................................................................
C2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 574 (71 bytes)
0x0000028c5607fc5f: add r10d,0fffffff9h
0x0000028c5607fc63: lea rax,[r12+rcx*8]
0x0000028c5607fc67: mov ebx,80000000h
0x0000028c5607fc6c: cmp r9d,r10d
0x0000028c5607fc6f: cmovl r10d,ebx
0x0000028c5607fc73: mov r9d,1h
0x0000028c5607fc79: cmp r10d,1h
╭ 0x0000028c5607fc7d: jle 28c5607fccch
│ 0x0000028c5607fc7f: nop ;*lload_2
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0,07% │↗ 0x0000028c5607fc80: movsxd rbx,dword ptr [rax+r9*4+10h]
0,06% ││ 0x0000028c5607fc85: add rbx,r8
8,93% ││ 0x0000028c5607fc88: movsxd rcx,r9d
0,41% ││ 0x0000028c5607fc8b: movsxd r8,dword ptr [rax+rcx*4+2ch]
25,02% ││ 0x0000028c5607fc90: movsxd rdi,dword ptr [rax+rcx*4+14h]
0,10% ││ 0x0000028c5607fc95: movsxd rsi,dword ptr [rax+rcx*4+18h]
8,56% ││ 0x0000028c5607fc9a: movsxd rbp,dword ptr [rax+rcx*4+28h]
0,58% ││ 0x0000028c5607fc9f: movsxd r13,dword ptr [rax+rcx*4+1ch]
0,41% ││ 0x0000028c5607fca4: movsxd r14,dword ptr [rax+rcx*4+20h]
0,20% ││ 0x0000028c5607fca9: movsxd rcx,dword ptr [rax+rcx*4+24h]
8,85% ││ 0x0000028c5607fcae: add rdi,rbx
0,38% ││ 0x0000028c5607fcb1: add rsi,rdi
0,15% ││ 0x0000028c5607fcb4: add r13,rsi
8,57% ││ 0x0000028c5607fcb7: add r14,r13
13,76% ││ 0x0000028c5607fcba: add rcx,r14
5,51% ││ 0x0000028c5607fcbd: add rbp,rcx
8,50% ││ 0x0000028c5607fcc0: add r8,rbp ;*ladd
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,95% ││ 0x0000028c5607fcc3: add r9d,8h ;*iinc
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
0,40% ││ 0x0000028c5607fcc7: cmp r9d,r10d
│╰ 0x0000028c5607fcca: jl 28c5607fc80h ;*if_icmpge
│ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
↘ 0x0000028c5607fccc: cmp r9d,edx
0x0000028c5607fccf: jnl 28c5607fce4h
0x0000028c5607fcd1: nop ;*lload_2
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x0000028c5607fcd4: movsxd r10,dword ptr [rax+r9*4+10h]
0x0000028c5607fcd9: add r8,r10 ;*ladd
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
....................................................................................................
Hottest region on JDK 14
....[Hottest Region 1]..............................................................................
c2, level 4, codes.dbg.IteratingBenchmark::cStyleForLoop, version 622 (147 bytes)
; - codes.dbg.IteratingBenchmark::cStyleForLoop@23 (line 25)
0x000001e844438f72: mov r11d,r10d
0x000001e844438f75: add r11d,0fffffff9h
0x000001e844438f79: lea rax,[r12+r9*8]
0x000001e844438f7d: mov ebx,1h
0x000001e844438f82: cmp r11d,1h
0x000001e844438f86: jle 1e8444390c0h ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
╭ 0x000001e844438f8c: jmp 1e844438ffah
│ 0x000001e844438f8e: nop
0,04% │↗ 0x000001e844438f90: mov rsi,r8 ;*lload_2 {reexecute=0 rethrow=0 return_oop=0}
││ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0,04% ││ ↗ 0x000001e844438f93: movsxd rdx,dword ptr [rax+rbx*4+10h]
8,41% ││ │ 0x000001e844438f98: movsxd rbp,dword ptr [rax+rbx*4+14h]
1,23% ││ │ 0x000001e844438f9d: movsxd r13,dword ptr [rax+rbx*4+18h]
0,03% ││ │ 0x000001e844438fa2: movsxd r8,dword ptr [rax+rbx*4+2ch]
23,87% ││ │ 0x000001e844438fa7: movsxd r11,dword ptr [rax+rbx*4+28h]
8,22% ││ │ 0x000001e844438fac: movsxd r9,dword ptr [rax+rbx*4+24h]
1,25% ││ │ 0x000001e844438fb1: movsxd rcx,dword ptr [rax+rbx*4+20h]
0,14% ││ │ 0x000001e844438fb6: movsxd r14,dword ptr [rax+rbx*4+1ch]
0,28% ││ │ 0x000001e844438fbb: add rdx,rsi
7,82% ││ │ 0x000001e844438fbe: add rbp,rdx
1,14% ││ │ 0x000001e844438fc1: add r13,rbp
0,17% ││ │ 0x000001e844438fc4: add r14,r13
14,57% ││ │ 0x000001e844438fc7: add rcx,r14
11,05% ││ │ 0x000001e844438fca: add r9,rcx
5,26% ││ │ 0x000001e844438fcd: add r11,r9
6,32% ││ │ 0x000001e844438fd0: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
8,45% ││ │ 0x000001e844438fd3: add ebx,8h ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
1,15% ││ │ 0x000001e844438fd6: cmp ebx,edi
│╰ │ 0x000001e844438fd8: jl 1e844438f90h ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@12 (line 24)
│ │ 0x000001e844438fda: mov r11,qword ptr [r15+110h] ; ImmutableOopMap {rax=Oop xmm0=Oop xmm1=Oop }
│ │ ;*goto {reexecute=1 rethrow=0 return_oop=0}
│ │ ; - (reexecute) codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
0,00% │ │ 0x000001e844438fe1: test dword ptr [r11],eax ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - codes.dbg.IteratingBenchmark::cStyleForLoop@29 (line 24)
│ │ ; {poll}
0,02% │ │ 0x000001e844438fe4: cmp ebx,dword ptr [rsp]
│ ╭│ 0x000001e844438fe7: jnl 1e844439028h
0,00% │ ││ 0x000001e844438fe9: mov rsi,r8
│ ││ 0x000001e844438fec: vmovq r8,xmm0
│ ││ 0x000001e844438ff1: vmovq rdx,xmm1
0,01% │ ││ 0x000001e844438ff6: mov r11d,dword ptr [rsp]
↘ ││ 0x000001e844438ffa: mov ecx,r10d
││ 0x000001e844438ffd: sub ecx,ebx
││ 0x000001e844438fff: add ecx,0fffffff9h
0,00% ││ 0x000001e844439002: mov r9d,1f40h
││ 0x000001e844439008: cmp r9d,ecx
││ 0x000001e84443900b: mov edi,1f40h
││ 0x000001e844439010: cmovnle edi,ecx
0,02% ││ 0x000001e844439013: add edi,ebx
││ 0x000001e844439015: vmovq xmm0,r8
││ 0x000001e84443901a: vmovq xmm1,rdx
││ 0x000001e84443901f: mov dword ptr [rsp],r11d
0,01% │╰ 0x000001e844439023: jmp 1e844438f93h
↘ 0x000001e844439028: vmovq rdx,xmm1
0x000001e84443902d: cmp ebx,r10d
0x000001e844439030: jnl 1e844439043h
0x000001e844439032: nop ;*lload_2 {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@15 (line 25)
0x000001e844439034: movsxd r11,dword ptr [rax+rbx*4+10h]
0x000001e844439039: add r8,r11 ;*ladd {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@24 (line 25)
0x000001e84443903c: inc ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - codes.dbg.IteratingBenchmark::cStyleForLoop@26 (line 24)
....................................................................................................
As we can see, the loop body is compiled similarly on both JDKs:
add
instructions;The key difference is that on JDK 14 the loop iteration is split into two nested blocks. This is a result of Loop strip mining optimization appeared in JDK 10. The idea of this optimization is to split the counted loop into the hot inner part without a safepoint poll, and an outer part with a safepoint poll instruction.
C2 JIT transforms the loop into something like
for (int i = 0; i < array.length; i += 8000) {
for (int j = 0; j < 8000; j += 8) {
int ix = i + j;
int v0 = array[ix];
int v1 = array[ix + 1];
...
int v7 = array[ix + 7];
sum += v0 + v1 + ... + v7;
}
safepoint_poll();
}
Note that JDK 8 version does not have a safepoint poll inside the counted loop at all. On one hand, this can make the loop run faster. But on the other hand, this is actually bad for low latency applications, since the pause time may increase by the duration of the entire loop.
JDK 14 inserts a safepoint poll inside the loop. This might be a reason of a slow down you observe, but I don't really belive in this, since due to loop strip mining optimization, the safepoint polling is performed just once in 8000 iterations.
To verify this, you may disable safepoint polling with -XX:-UseCountedLoopSafepoints
JVM option. In this case, JDK 14 compiled version will look almost identical to JDK 8 one. And so will the benchmark scores.