looping over array, performance difference between indexed and enhanced for loop

The JLS states, that for arrays, "The enhanced for statement is equivalent to a basic for statement of the form". However if I check the generated bytecode for JDK8, for both variants different bytecode is generated, and if I try to measure the performance, surprisingly, the enhanced one seems to be giving better results(on jdk8)... Can someone advise why it's that? I'd guess it's because of incorrect jmh testing, so if it's that, please suggest how to fix that. (I know that JMH states not to test using loops, but I don't think this applies here, as I'm actually trying to measure the loops here)

My JMH testing was rather simple (probably too simple), but I cannot explain the results. Testing JMH code is below, typical results are:

JdkBenchmarks.enhanced  avgt    5  2556.281 ±  31.789  ns/op
JdkBenchmarks.indexed   avgt    5  4032.164 ± 100.121  ns/op

meaning typically enhanced for loop is faster, and measurement for it is more accurate than for indexed loop, so we cannot address the difference to measurement uncertainty. Principally the same results are for array initialized with random integers, or bigger arrays.

public class JdkBenchmarks {

    @Benchmark
    @BenchmarkMode(AverageTime)
    @OutputTimeUnit(NANOSECONDS)
    public void indexed(Blackhole blackhole, TestState testState) {
        int length = testState.values.length;
        for(int i = 0; i < length; i++) {
            blackhole.consume(testState.values[i]);
        }
    }

    @Benchmark
    @BenchmarkMode(AverageTime)
    @OutputTimeUnit(NANOSECONDS)
    public void enhanced(Blackhole blackhole, TestState testState) {
        for (int value : testState.values) {
            blackhole.consume(value);
        }
    }


    @State(Scope.Benchmark)
    public static class TestState {
        public int[] values;

        @Setup
        public void setupArray() {
            int count = 1000;
            values = new int[count];
            for(int i = 0; i < count; i++) {
                values[i] = i;
            }
        }
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(JdkBenchmarks.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();
    }

}

Solution

TL;DR: You are observing what happens when JIT compiler cannot trust that values are not changing inside the loop. Additionally, in the tiny benchmark like this, Blackhole.consume costs dominate, obscuring the results.

Simplifying the test:

@Warmup(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class JdkBenchmarks {

    public int[] values;

    @Setup
    public void setupArray() {
        int count = 1000;
        values = new int[count];
        for(int i = 0; i < count; i++) {
            values[i] = i;
        }
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void indexed(Blackhole bh) {
        for(int i = 0; i < values.length; i++) {
            bh.consume(values[i]);
        }
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void indexed_cached(Blackhole bh) {
        int[] vs = values;
        int length = vs.length;
        for(int i = 0; i < length; i++) {
            bh.consume(vs[i]);
        }
    }

    @Benchmark
    @CompilerControl(CompilerControl.Mode.DONT_INLINE)
    public void enhanced(Blackhole bh) {
        for (int value : values) {
            bh.consume(value);
        }
    }
}

Running both enhanced and indexed_cached under -prof perfasm reveals this hot loop (I specifically did @CompilerControl(DONT_INLINE) to let @Benchmark method be compiled alone, which makes an easier to digest perfasm output):

         ↗  0x...4240: mov  0x10(%r8,%rsi,4),%r10d  ; load values[i], blackhole it
 22.68%  │  0x...4245: mov  0x14(%r8,%rsi,4),%r11d  ; ... repeat 7 more times...
         │  0x...424a: mov  0x18(%r8,%rsi,4),%r10d  ;
 20.95%  │  0x...424f: mov  0x1c(%r8,%rsi,4),%r10d  ;
  0.02%  │  0x...4254: mov  0x20(%r8,%rsi,4),%r11d  ;
 24.73%  │  0x...4259: mov  0x24(%r8,%rsi,4),%r10d  ;
  0.24%  │  0x...425e: mov  0x28(%r8,%rsi,4),%r11d  ;
 20.04%  │  0x...4263: mov  0x2c(%r8,%rsi,4),%r10d  ; 
  0.22%  │  0x...4268: add  $0x8,%esi               ; i += 8
         │  0x...426b: cmp  %ebp,%esi               ; compare i with length (in %ebp)
  0.26%  ╰  0x...426d: jl   0x...4240               ; circle back if 8 more elements available

Very efficient!

Running indexed with -prof perfasm reveals:

         ↗  0x...4170: mov  0xc(%r12,%r8,8),%r9d    ; array bounds check, load values.length
  3.42%  │  0x...4175: cmp  %r9d,%r10d              ; array bounds check, compare i
 16.02%  │  0x...4178: jae  0x...41b1               ;  failed? jump to exception handling
         │  0x...417a: lea  (%r12,%r8,8),%r11       ; load values[i], part 1
  0.04%  │  0x...417e: mov  0x10(%r11,%r10,4),%r11d ; load values[i], part 2
         │                                          ; %r11d is blackholed
 35.69%  │  0x...4183: mov  0xc(%rsi),%r8d          ; get "values"
  0.71%  │  0x...4187: mov  0x348(%r15),%r11        ; safepoint poll, part 1 (JVM infra)
  4.03%  │  0x...418e: inc  %r10d                   ; i++
  0.12%  │  0x...4191: test %eax,(%r11)             ; safepoint poll, part 2 (JVM infra)
 27.74%  │  0x...4194: mov  0xc(%r12,%r8,8),%r9d    ; load values.length
  8.53%  │  0x...4199: cmp  %r9d,%r10d              ; check i < values.length
  0.24%  ╰  0x...419c: jl   0x...4170               ; circle back if more

This is because Blackhole.consume call is opaque to the compiler (like many other non-inlined calls), so it has to conservatively presume that values can change in the middle of the loop!

Which means, compiler cannot stash values in a register, it cannot trust the array bounds check to always succeed, it cannot even guarantee the loop terminates (hence safepoint polls), and on top of that, loop unrolling does not want to multiply that per-element mess even more.

So you get the penalty like this (TR 3970X, JDK 17.0.2 EA, Linux x86_64):

Benchmark                     Mode  Cnt     Score   Error  Units
JdkBenchmarks.enhanced        avgt    5   144.962 ± 0.918  ns/op
JdkBenchmarks.indexed         avgt    5  1030.981 ± 3.775  ns/op ; + 880 ns/op!
JdkBenchmarks.indexed_cached  avgt    5   144.799 ± 0.643  ns/op ; same as enhanced

Additional fun part:

On most JDKs the dominating costs are the costs of calling the Blackhole.consume in this test. Compared to the cost of array access, the cost of Java-style Blackhole is quite bad. With JDK 17+ and JMH 1.34, the compiler Blackholes would be used, and thus provide much more fidelity for the test.

Without compiler blackholes, the effect hides in the Blackhole overhead nearly completely (>25x overhead means we can execute a lot of bad code preceding the Blackhole call!):

Benchmark                     Mode  Cnt     Score   Error  Units
JdkBenchmarks.enhanced        avgt    5  4062.866 ± 4.736  ns/op
JdkBenchmarks.indexed         avgt    5  4071.620 ± 1.057  ns/op ; + 10 ns/op [whoops]
JdkBenchmarks.indexed_cached  avgt    5  4061.390 ± 0.692  ns/op ; same as enhanced

It would re-manifest if we drop @CompilerControl(DONT_INLINE), because the resulting generated code would be much messier:

Benchmark                     Mode  Cnt     Score    Error  Units
JdkBenchmarks.enhanced        avgt    5  4067.118 ± 40.699  ns/op
JdkBenchmarks.indexed         avgt    5  4601.370 ±  0.632  ns/op ; + 530 ns/op
JdkBenchmarks.indexed_cached  avgt    5  4064.455 ±  1.554  ns/op ; same as enhanced