Search code examples
javaperformanceperformance-testingbenchmarkingjmh

What kind of JVM optimization happens for ByteArrayOutputStream?


I have the following JMH benchmark (Java8):

@Benchmark
public byte[] outputStream() {
    final ByteArrayOutputStream baos = new ByteArrayOutputStream();
    for (int i = 0; i < size; i++) {
        baos.write(i);
    }
    return baos.toByteArray();
}

When e.g. the size == 65 the output is the following:

# Warmup Iteration   1: 3296444.108 ops/s
# Warmup Iteration   2: 2861235.712 ops/s
# Warmup Iteration   3: 4909462.444 ops/s
# Warmup Iteration   4: 4969418.622 ops/s
# Warmup Iteration   5: 5009353.033 ops/s
Iteration   1: 5006466.075 ops/sm 19s]
...

Obviously, something happened during the Warmup #2, so there is a massive speedup after it.

How can I figure what kind of JVM optimization happened at that point?


Solution

  • Let's assume you have a stable result at 5M ops/s, is that believable? For the sake of argument let's assume a 3GHz CPU(you are probably on a laptop with frequency scaling and turbo boost on, but anyway), 5M ops/s => 200ns per op => 600 cycles. What did we ask the CPU to do?

    • Allocate ByteArrayOutputStream, default constructor -> new byte[32], + change
    • Simple counted loop, 65 times, write a byte to array
    • Resize the byte array, 2 times. 32 -> 64 -> 128
    • Copy to new array (65) and return
    • Trivial loop accounting for JMH

    What kind of optimizations could we hope to have happen?

    • Going from interpreter to native compilation(Duh)
    • Loop unrolling and a ton of loop optimizations, all of which probably don't help much
    • Escape analysis for ByteArrayOutputStream and its meriad of buddies. I don't think it happened.

    How do I know what happened? I'll run it with some helpful profilers. JMH offers loads of those.

    With -prof gc I can see what's the allocation rate here: ·gc.alloc.rate.norm: 360.000 B/op So, I guessed 32 + 64 + 128 + 65 + change = 289b + change => change = 71b, that's allot of change, right? well, not if you account for object headers. We have 4 arrays and one object => 5 * 12 (compressed oops headers) = 60, and the array length + count field on `ByteArrayOutputStream' = 20. So change should be 80b, by my calculation, but I'm probably missing something. Bottom line is, no EscapeAnalysis for us. But some CompressedOops help. You can use an allocation profiler, like the one in JVisualVM to track down all the different allocations here, or a sampling allocation profiler like the one in Java Mission Control.

    You can look at the assembly output and profile at that level using -prof perfasm. That is a very long exercise, so I'm not going to go through it here. One of the cool optimizations you can see there is that the JVM is not zeroing out the new array copy it makes at the end of your method. You can also see that the allocation and copying of arrays are where time is spent as expected.

    Finally, the obvious optimizations to take place here are just JIT compilation. You can explore what each level of compilation did by using a tool like JITWatch. You can use command line flags to find the performance at each compilation level (-jvmArgs=-Xint' to run in the interpreter-XX:TieredStopAtLevel=1` to stop at C1).

    Another large scale optimization at play is the expanding of the heap to accommodate the allocation rate. You can experiment with heap sizes to find how this effects the performance.

    Have fun :-)