Consider the following two snippets of code on an array of length 2:
boolean isOK(int i) {
for (int j = 0; j < filters.length; ++j) {
if (!filters[j].isOK(i)) {
return false;
}
}
return true;
}
and
boolean isOK(int i) {
return filters[0].isOK(i) && filters[1].isOK(i);
}
I would assume that the performance of these two pieces should be similar after sufficient warm-up.
I've checked this using JMH micro-benchmarking framework as described e.g. here and here and observed that the second snippet is more than 10% faster.
Question: why hasn't Java optimized my first snippet using the basic loop unrolling technique?
In particular, I'd like to understand the following:
return (filters.length) == 2 ? new FilterChain2(filters) : new FilterChain1(filters)
. Can JITC do the same and if not, why?Ideally, I would like to receive an answer from someone with a deep understanding of how JITC works.
Benchmark run details:
Typical benchmark output:
Benchmark (filterIndex) Mode Cnt Score Error Units
LoopUnrollingBenchmark.runBenchmark 0 avgt 400 44.202 ± 0.224 ns/op
LoopUnrollingBenchmark.runBenchmark 1 avgt 400 38.347 ± 0.063 ns/op
(The first line corresponds to the first snippet, the second line - to the second.
Complete benchmark code:
public class LoopUnrollingBenchmark {
@State(Scope.Benchmark)
public static class BenchmarkData {
public Filter[] filters;
@Param({"0", "1"})
public int filterIndex;
public int num;
@Setup(Level.Invocation) //similar ratio with Level.TRIAL
public void setUp() {
filters = new Filter[]{new FilterChain1(), new FilterChain2()};
num = new Random().nextInt();
}
}
@Benchmark
@Fork(warmups = 5, value = 20)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public int runBenchmark(BenchmarkData data) {
Filter filter = data.filters[data.filterIndex];
int sum = 0;
int num = data.num;
if (filter.isOK(num)) {
++sum;
}
if (filter.isOK(num + 1)) {
++sum;
}
if (filter.isOK(num - 1)) {
++sum;
}
if (filter.isOK(num * 2)) {
++sum;
}
if (filter.isOK(num * 3)) {
++sum;
}
if (filter.isOK(num * 5)) {
++sum;
}
return sum;
}
interface Filter {
boolean isOK(int i);
}
static class Filter1 implements Filter {
@Override
public boolean isOK(int i) {
return i % 3 == 1;
}
}
static class Filter2 implements Filter {
@Override
public boolean isOK(int i) {
return i % 7 == 3;
}
}
static class FilterChain1 implements Filter {
final Filter[] filters = createLeafFilters();
@Override
public boolean isOK(int i) {
for (int j = 0; j < filters.length; ++j) {
if (!filters[j].isOK(i)) {
return false;
}
}
return true;
}
}
static class FilterChain2 implements Filter {
final Filter[] filters = createLeafFilters();
@Override
public boolean isOK(int i) {
return filters[0].isOK(i) && filters[1].isOK(i);
}
}
private static Filter[] createLeafFilters() {
Filter[] filters = new Filter[2];
filters[0] = new Filter1();
filters[1] = new Filter2();
return filters;
}
public static void main(String[] args) throws Exception {
org.openjdk.jmh.Main.main(args);
}
}
TL;DR The main reason of performance difference here is not related to loop unrolling. It is rather the type speculation and the inline caches.
In fact, in HotSpot terminology, such loops are treated as counted, and in certain cases JVM can unroll them. Not in your case though.
HotSpot has two loop unrolling strategies: 1) unroll maximally, i.e. remove the loop altogether; or 2) glue several consecutive iterations together.
Maximal unrolling can be done, only if the exact number of iterations is known.
if (!cl->has_exact_trip_count()) {
// Trip count is not exact.
return false;
}
In your case, however, the function may return early after the first iteration.
Partial unrolling could be probably applied, but the following condition breaks unrolling:
// Don't unroll if the next round of unrolling would push us
// over the expected trip count of the loop. One is subtracted
// from the expected trip count because the pre-loop normally
// executes 1 iteration.
if (UnrollLimitForProfileCheck > 0 &&
cl->profile_trip_cnt() != COUNT_UNKNOWN &&
future_unroll_ct > UnrollLimitForProfileCheck &&
(float)future_unroll_ct > cl->profile_trip_cnt() - 1.0) {
return false;
}
Since in your case the expected trip count is less than 2, HotSpot assumes it's not worthy to unroll even two iterations. Note that the first iteration is extracted into pre-loop anyway (loop peeling optimization), so unrolling is indeed not very benificial here.
In your unrolled version, there are two different invokeinterface
bytecodes. These sites have two distinct type profiles. The first receiver is always Filter1
, and the second receiver is always Filter2
. So, you basically have two monomorphic call sites, and HotSpot can perfectly inline both calls - so called "inline cache" which has 100% hit ratio in this case.
With the loop, there is just one invokeinterface
bytecode, and only one type profile is collected. HotSpot JVM sees that filters[j].isOK()
is called 86% times with Filter1
receiver and 14% times with Filter2
receiver. This will be a bimorphic call. Fortunately, HotSpot can speculatively inline bimorphic calls, too. It inlines both targets with a conditional branch. However, in this case the hit ratio will be at most 86%, and the performance will suffer from the corresponding mispredicted branches at the architecture level.
Things will be even worse, if you have 3 or more different filters. In this case isOK()
will be a megamorphic call which HotSpot cannot inline at all. So, the compiled code will contain a true interface call which has a larger performance impact.
More about speculative inlining in the article The Black Magic of (Java) Method Dispatch.
In order to inline virtual/interface calls, HotSpot JVM collects type profiles per invoke bytecode. If there is a virtual call in a loop, there will be just one type profile for the call, no matter if the loop is unrolled or not.
To get the best from the virtual call optimizations, you'd need to manually split the loop, primarily for the purpose of splitting type profiles. HotSpot cannot do this automatically so far.