Why does Java HotSpot can not optimize array instance after one-time resizing (leads to massive performance loss)?

Question

Why is the use of fBuffer1 in the attached code example (SELECT_QUICK = true) double as fast as the other variant when fBuffer2 is resized only once at the beginning (SELECT_QUICK = false)?

The code path is absolutely identical but even after 10 minutes the throughput of fBuffer2 does not increase to this level of fBuffer1.

Background:

We have a generic data processing framework that collects thousands of Java primitive values in different subclasses (one subclass for each primitive type). These values are stored internally in arrays, which we originally sized sufficiently large. To save heap memory, we have now switched these arrays to dynamic resizing (arrays grow only if needed). As expected, this change has massively reduce the heap memory. However, on the other hand the performance has unfortunately degenerated significantly. Our processing jobs now take 2-3 times longer as before (e.g. 6 min instead of 2 min as before).

I have reduced our problem to a minimum working example and attached it. You can choose with SELECT_QUICK which buffer should be used. I see the same effect with jdk-1.8.0_202-x64 as well as with openjdk-17.0.1-x64.

Buffer 1 (is not resized) shows the following numbers:

duration buf1: 8,890.551ms (8.9s)
duration buf1: 8,339.755ms (8.3s)
duration buf1: 8,620.633ms (8.6s)
duration buf1: 8,682.809ms (8.7s)
...

Buffer 2 (is resized exactly 1 time at the beginning) shows the following numbers:

make buffer 2 larger
duration buf2 (resized): 19,542.750ms (19.5s)
duration buf2 (resized): 22,423.529ms (22.4s)
duration buf2 (resized): 22,413.364ms (22.4s)
duration buf2 (resized): 22,219.383ms (22.2s)
...

I would really appreciate to get some hints, how I can change the code so that fBuffer2 (after resizing) works as fast as fBuffer1. The other way round (make fBuffer1 as slow as fBuffer2) is pretty easy. ;-) Since this problem is placed in some framework-like component I would prefer to change the code instead of tuning the hotspot (with external arguments). But of course, suggestions in both directions are very welcome.

Source Code

import java.util.Locale;

public final class Collector {

    private static final boolean SELECT_QUICK = true;

    private static final long LOOP_COUNT = 50_000L;
    private static final int VALUE_COUNT = 150_000;
    private static final int BUFFER_LENGTH = 100_000;

    private final Buffer fBuffer = new Buffer();

    public void reset() {fBuffer.reset();}
    public void addValueBuf1(long val) {fBuffer.add1(val);}
    public void addValueBuf2(long val) {fBuffer.add2(val);}

    public static final class Buffer {

        private int fIdx = 0;
        private long[] fBuffer1 = new long[BUFFER_LENGTH * 2];
        private long[] fBuffer2 = new long[BUFFER_LENGTH];

        public void reset() {fIdx = 0;}

        public void add1(long value) {
            ensureRemainingBuf1(1);
            fBuffer1[fIdx++] = value;
        }

        public void add2(long value) {
            ensureRemainingBuf2(1);
            fBuffer2[fIdx++] = value;
        }

        private void ensureRemainingBuf1(int remaining) {
            if (remaining > fBuffer1.length - fIdx) {
                System.out.println("make buffer 1 larger");
                fBuffer1 = new long[(fIdx + remaining) << 1];
            }
        }

        private void ensureRemainingBuf2(int remaining) {
            if (remaining > fBuffer2.length - fIdx) {
                System.out.println("make buffer 2 larger");
                fBuffer2 = new long[(fIdx + remaining) << 1];
            }
        }

    }

    public static void main(String[] args) {
        Locale.setDefault(Locale.ENGLISH);
        final Collector collector = new Collector();
        if (SELECT_QUICK) {
            while (true) {
                final long start = System.nanoTime();
                for (long j = 0L; j < LOOP_COUNT; j++) {
                    collector.reset();
                    for (int k = 0; k < VALUE_COUNT; k++) {
                        collector.addValueBuf1(k);
                    }
                }
                final long nanos = System.nanoTime() - start;
                System.out.printf("duration buf1: %1$,.3fms (%2$,.1fs)%n",
                    nanos / 1_000_000d, nanos / 1_000_000_000d);
            }
        } else {
            while (true) {
                final long start = System.nanoTime();
                for (long j = 0L; j < LOOP_COUNT; j++) {
                    collector.reset();
                    for (int k = 0; k < VALUE_COUNT; k++) {
                        collector.addValueBuf2(k);
                    }
                }
                final long nanos = System.nanoTime() - start;
                System.out.printf("duration buf2 (resized): %1$,.3fms (%2$,.1fs)%n",
                    nanos / 1_000_000d, nanos / 1_000_000_000d);
            }
        }
    }

}

Solution

JIT compilation in HotSpot JVM is 1) based on runtime profile data; 2) uses speculative optimizations.

Once the method is compiled at the maximum optimization level, HotSpot stops profiling this code, so it is never recompiled afterwards, no matter how long the code runs. (The exception is when the method needs to be deoptimized or unloaded, but it's not your case).

In the first case (SELECT_QUICK == true), the condition remaining > fBuffer1.length - fIdx is never met, and HotSpot JVM is aware of that from profiling data collected at lower tiers. So it speculatively hoists the check out of the loop, and compiles the loop body with the assumption that array index is always within bounds. After the optimization, the loop is compiled like this (in pseudocode):

if (VALUE_COUNT > collector.fBuffer.fBuffer1.length) {
    uncommon_trap();
}
for (int k = 0; k < VALUE_COUNT; k++) {
    collector.fBuffer.fBuffer1[k] = k;  // no bounds check
}

In the second case (SELECT_QUICK == false), on the contrary, HotSpot knows that condition remaining > fBuffer2.length - fIdx is sometimes met, so it cannot eliminate the check.

Since fIdx is not the loop counter, HotSpot is apparently not smart enough to split the loop into two parts (with and without bounds check). However, you can help JIT compiler by splitting the loop manually:

for (long j = 0L; j < LOOP_COUNT; j++) {
    collector.reset();

    int fastCount = Math.min(collector.fBuffer.fBuffer2.length, VALUE_COUNT);
    for (int k = 0; k < fastCount; k++) {
        collector.addValueBuf2Fast(k);
    }

    for (int k = fastCount; k < VALUE_COUNT; k++) {
        collector.addValueBuf2(k);
    }
}

where addValueBuf2Fast inserts a value without bounds check:

    public void addValueBuf2Fast(long val) {fBuffer.add2Fast(val);}

    public static final class Buffer {
        ...
        public void add2Fast(long value) {
            fBuffer2[fIdx++] = value;
        }
    }

This should dramatically improve performance of the loop:

make buffer 2 larger
duration buf2 (resized): 5,537.681ms (5.5s)
duration buf2 (resized): 5,461.519ms (5.5s)
duration buf2 (resized): 5,450.445ms (5.5s)