Search code examples
javaperformancejvmmicrobenchmark

Is the JVM able to do trivial recursive call precomputation?


A while ago, I ran rust benchmarks of two different software multiplication algorithms: trivial recursive multiplication and Russian peasant multiplication.

To my amazement the compiler was able to analyze the trivial recursion, replacing the call to the method directly with the result (e.g. calling mul0(4,8) -> 32 ).

To see if the JVM is able to perform the same optimization, I measured the below Java implementation via JMH. Yet, the Russian peasant algorithm is faster and it seems that the VM is not performing any similar optimization.

Is there a similar optimization technique (replace recursive call with precomputed result) build into the JVM or is this something that the JVM does not do per se for some reason?

I am aware that this is VM dependent and might change, so I am more interested in general roadblocks that hinder VM implementators to incorporate such an optimization into their VM.

Code snippets:

@Warmup(iterations = 10)
@Fork(value = 2)
@State(Scope.Benchmark)
public class MyBenchmark {

    private int F1 = 542;
    private int F2 = 323;

    public final static int mul0(int a, int b) {
        if (a == 1) {
            return b;
        }
        return mul0(a - 1, b) + b;
    }

    //O(log n)
    public final static int mul2(int a, int b) {
        if (a == 1) {
            return b;
        }

        int sum = ((a & 1) == 1) ? b : 0;

        return mul2(a / 2, b + b) + sum;
    }

    @Benchmark
    public void test0() {
        mul0(F1, F2);
    }

    @Benchmark
    public void test2() {
        mul2(F1, F2);
    }

}

Results:

Result: 13852692,903 ▒(99.9%) 532102,125 ops/s [Average]
  Statistics: (min, avg, max) = (9899651,068, 13852692,903, 15356453,576), stdev = 945811,061
  Confidence interval (99.9%): [13320590,778, 14384795,028]


# Run complete. Total time: 00:02:22

Benchmark                   Mode  Samples         Score  Score error  Units
d.s.m.MyBenchmark.test0    thrpt       40   1453817,627    68528,256  ops/s
d.s.m.MyBenchmark.test2    thrpt       40  13852692,903   532102,125  ops/s

Solution

  • Short answer

    HotSpot JVM is capable of such optimization, but the default JVM options prevent from doing this.

    Long answer

    First of all, the benchmark needs to be slightly corrected to see the effect.

    1. By design, fields of @State classes are re-read on each iteration. JVM does not know they are constants, so it cannot constant-fold them. Make F1 and F2 final to make them constants and allow further optimization.
    2. Benchmark methods should consume computation results either by calling Blackhole.consume or simply by returning a value from the method.

      private final int F1 = 542;
      private final int F2 = 323;
      
      public final static int mul0(int a, int b) {
          if (a == 1) {
              return b;
          }
          return mul0(a - 1, b) + b;
      }
      
      //O(log n)
      public final static int mul2(int a, int b) {
          if (a == 1) {
              return b;
          }
      
          int sum = ((a & 1) == 1) ? b : 0;
      
          return mul2(a / 2, b + b) + sum;
      }
      
      @Benchmark
      public int test0() {
          return mul0(F1, F2);
      }
      
      @Benchmark
      public int test2() {
          return mul2(F1, F2);
      }
      

    Now HotSpot can inline method calls and perform constant folding. However, by default the inlining of recursive methods is limited just by one level. We can override this with the following options:

    -XX:MaxInlineLevel=20 -XX:MaxRecursiveInlineLevel=20
    

    Now test2 becomes really fast, since it obviously performs less than 20 method calls:

    Benchmark               Mode  Cnt    Score    Error  Units
    MyBenchmark.test0       avgt    5  675,763 ± 16,422  ns/op
    MyBenchmark.test2       avgt    5    5,320 ±  0,274  ns/op
    

    Looking into generated assembly code using -prof perfasm we may verify that test2 returns the precomputed value:

    0x00000000038e5960: mov    %r10,0x20(%rsp)
    0x00000000038e5965: mov    0x58(%rsp),%rdx
    0x00000000038e596a: mov    $0x2abda,%r8d        <<<<
    0x00000000038e5970: data32 xchg %ax,%ax
    0x00000000038e5973: callq  0x00000000037061a0  ;*invokevirtual consume
    

    0x2abda = 175066 = 542 * 323 = mul2(F1, F2)