java performance optimization conditional-operator

Why is a ternary operator with two constants faster than one with a variable?

In Java, I have two different statements which accomplish the same result through using ternary operators, which are as follows:

num < 0 ? 0 : num;
num * (num < 0 ? 0 : 1);

It appears that the second statement is unnecessarily complex and would take longer than the first, however when I recorded the time that each took, using the following code, the results were as follows:

final long startTime = System.currentTimeMillis();

Random rand = new Random();
float[] results = new float[100000000];
for (int i = 0; i < 100000000; i++) {
    float num = (rand.nextFloat() * 2) - 1;
    results[i] = num < 0 ? 0 : num;
    //results[i] = num * (num < 0 ? 0 : 1);
}

final long endTime = System.currentTimeMillis();

System.out.println("Total Time: " + (endTime - startTime));

1.232 seconds
1.023 seconds (Each averaged over 5 runs)

Why is there this significant speedup when using the second statement? It seems to include an unnecessary multiplication and have the same comparison. Does the first create a branch whilst the second does not?

Solution

First, let's rewrite the benchmark with JMH to avoid common benchmarking pitfalls.

public class FloatCompare {

    @Benchmark
    public float cmp() {
        float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
        return num < 0 ? 0 : num;
    }

    @Benchmark
    public float mul() {
        float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
        return num * (num < 0 ? 0 : 1);
    }
}

JMH also suggests that the multiplication code is a way faster:

Benchmark         Mode  Cnt   Score   Error  Units
FloatCompare.cmp  avgt    5  12,940 ± 0,166  ns/op
FloatCompare.mul  avgt    5   6,182 ± 0,101  ns/op

Now it's time to engage perfasm profiler (built into JMH) to see the assembly produced by JIT compiler. Here are the most important parts of the output (comments are mine):

cmp method:

  5,65%  │││  0x0000000002e717d0: vxorps  xmm1,xmm1,xmm1  ; xmm1 := 0
  0,28%  │││  0x0000000002e717d4: vucomiss xmm1,xmm0      ; compare num < 0 ?
  4,25%  │╰│  0x0000000002e717d8: jbe     2e71720h        ; jump if num >= 0
  9,77%  │ ╰  0x0000000002e717de: jmp     2e71711h        ; jump if num < 0

mul method:

  1,59%  ││  0x000000000321f90c: vxorps  xmm1,xmm1,xmm1    ; xmm1 := 0
  3,80%  ││  0x000000000321f910: mov     r11d,1h           ; r11d := 1
         ││  0x000000000321f916: xor     r8d,r8d           ; r8d := 0
         ││  0x000000000321f919: vucomiss xmm1,xmm0        ; compare num < 0 ?
  2,23%  ││  0x000000000321f91d: cmovnbe r11d,r8d          ; r11d := r8d if num < 0
  5,06%  ││  0x000000000321f921: vcvtsi2ss xmm1,xmm1,r11d  ; xmm1 := (float) r11d
  7,04%  ││  0x000000000321f926: vmulss  xmm0,xmm1,xmm0    ; multiply

The key difference is that there's no jump instructions in the mul method. Instead, conditional move instruction cmovnbe is used.

cmov works with integer registers. Since (num < 0 ? 0 : 1) expression uses integer constants on the right side, JIT is smart enough to emit a conditional move instead of a conditional jump.

In this benchmark, conditional jump is very inefficient, since branch prediction often fails due to random nature of numbers. That's why the branchless code of mul method appears faster.

If we modify the benchmark in a way that one branch prevails over another, e.g by replacing

ThreadLocalRandom.current().nextFloat() * 2 - 1

with

ThreadLocalRandom.current().nextFloat() * 2 - 0.1f

then the branch prediction will work better, and cmp method will become as fast as mul:

Benchmark         Mode  Cnt  Score   Error  Units
FloatCompare.cmp  avgt    5  5,793 ± 0,045  ns/op
FloatCompare.mul  avgt    5  5,764 ± 0,048  ns/op