A while ago, I ran rust benchmarks of two different software multiplication algorithms: trivial recursive multiplication and Russian peasant multiplication.
To my amazement the compiler was able to analyze the trivial recursion, replacing the call to the method directly with the result (e.g. calling mul0(4,8) -> 32
).
To see if the JVM is able to perform the same optimization, I measured the below Java implementation via JMH. Yet, the Russian peasant algorithm is faster and it seems that the VM is not performing any similar optimization.
Is there a similar optimization technique (replace recursive call with precomputed result) build into the JVM or is this something that the JVM does not do per se for some reason?
I am aware that this is VM dependent and might change, so I am more interested in general roadblocks that hinder VM implementators to incorporate such an optimization into their VM.
Code snippets:
@Warmup(iterations = 10)
@Fork(value = 2)
@State(Scope.Benchmark)
public class MyBenchmark {
private int F1 = 542;
private int F2 = 323;
public final static int mul0(int a, int b) {
if (a == 1) {
return b;
}
return mul0(a - 1, b) + b;
}
//O(log n)
public final static int mul2(int a, int b) {
if (a == 1) {
return b;
}
int sum = ((a & 1) == 1) ? b : 0;
return mul2(a / 2, b + b) + sum;
}
@Benchmark
public void test0() {
mul0(F1, F2);
}
@Benchmark
public void test2() {
mul2(F1, F2);
}
}
Results:
Result: 13852692,903 ▒(99.9%) 532102,125 ops/s [Average]
Statistics: (min, avg, max) = (9899651,068, 13852692,903, 15356453,576), stdev = 945811,061
Confidence interval (99.9%): [13320590,778, 14384795,028]
# Run complete. Total time: 00:02:22
Benchmark Mode Samples Score Score error Units
d.s.m.MyBenchmark.test0 thrpt 40 1453817,627 68528,256 ops/s
d.s.m.MyBenchmark.test2 thrpt 40 13852692,903 532102,125 ops/s
HotSpot JVM is capable of such optimization, but the default JVM options prevent from doing this.
First of all, the benchmark needs to be slightly corrected to see the effect.
@State
classes are re-read on each iteration. JVM does not know they are constants, so it cannot constant-fold them. Make F1
and F2
final to make them constants and allow further optimization.Benchmark methods should consume computation results either by calling Blackhole.consume
or simply by returning a value from the method.
private final int F1 = 542;
private final int F2 = 323;
public final static int mul0(int a, int b) {
if (a == 1) {
return b;
}
return mul0(a - 1, b) + b;
}
//O(log n)
public final static int mul2(int a, int b) {
if (a == 1) {
return b;
}
int sum = ((a & 1) == 1) ? b : 0;
return mul2(a / 2, b + b) + sum;
}
@Benchmark
public int test0() {
return mul0(F1, F2);
}
@Benchmark
public int test2() {
return mul2(F1, F2);
}
Now HotSpot can inline method calls and perform constant folding. However, by default the inlining of recursive methods is limited just by one level. We can override this with the following options:
-XX:MaxInlineLevel=20 -XX:MaxRecursiveInlineLevel=20
Now test2
becomes really fast, since it obviously performs less than 20 method calls:
Benchmark Mode Cnt Score Error Units
MyBenchmark.test0 avgt 5 675,763 ± 16,422 ns/op
MyBenchmark.test2 avgt 5 5,320 ± 0,274 ns/op
Looking into generated assembly code using -prof perfasm
we may verify that test2
returns the precomputed value:
0x00000000038e5960: mov %r10,0x20(%rsp)
0x00000000038e5965: mov 0x58(%rsp),%rdx
0x00000000038e596a: mov $0x2abda,%r8d <<<<
0x00000000038e5970: data32 xchg %ax,%ax
0x00000000038e5973: callq 0x00000000037061a0 ;*invokevirtual consume
0x2abda = 175066 = 542 * 323 = mul2(F1, F2)