In Java, I have two different statements which accomplish the same result through using ternary operators, which are as follows:
num < 0 ? 0 : num;
num * (num < 0 ? 0 : 1);
It appears that the second statement is unnecessarily complex and would take longer than the first, however when I recorded the time that each took, using the following code, the results were as follows:
final long startTime = System.currentTimeMillis();
Random rand = new Random();
float[] results = new float[100000000];
for (int i = 0; i < 100000000; i++) {
float num = (rand.nextFloat() * 2) - 1;
results[i] = num < 0 ? 0 : num;
//results[i] = num * (num < 0 ? 0 : 1);
}
final long endTime = System.currentTimeMillis();
System.out.println("Total Time: " + (endTime - startTime));
Why is there this significant speedup when using the second statement? It seems to include an unnecessary multiplication and have the same comparison. Does the first create a branch whilst the second does not?
First, let's rewrite the benchmark with JMH to avoid common benchmarking pitfalls.
public class FloatCompare {
@Benchmark
public float cmp() {
float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
return num < 0 ? 0 : num;
}
@Benchmark
public float mul() {
float num = ThreadLocalRandom.current().nextFloat() * 2 - 1;
return num * (num < 0 ? 0 : 1);
}
}
JMH also suggests that the multiplication code is a way faster:
Benchmark Mode Cnt Score Error Units
FloatCompare.cmp avgt 5 12,940 ± 0,166 ns/op
FloatCompare.mul avgt 5 6,182 ± 0,101 ns/op
Now it's time to engage perfasm profiler (built into JMH) to see the assembly produced by JIT compiler. Here are the most important parts of the output (comments are mine):
cmp
method:
5,65% │││ 0x0000000002e717d0: vxorps xmm1,xmm1,xmm1 ; xmm1 := 0
0,28% │││ 0x0000000002e717d4: vucomiss xmm1,xmm0 ; compare num < 0 ?
4,25% │╰│ 0x0000000002e717d8: jbe 2e71720h ; jump if num >= 0
9,77% │ ╰ 0x0000000002e717de: jmp 2e71711h ; jump if num < 0
mul
method:
1,59% ││ 0x000000000321f90c: vxorps xmm1,xmm1,xmm1 ; xmm1 := 0
3,80% ││ 0x000000000321f910: mov r11d,1h ; r11d := 1
││ 0x000000000321f916: xor r8d,r8d ; r8d := 0
││ 0x000000000321f919: vucomiss xmm1,xmm0 ; compare num < 0 ?
2,23% ││ 0x000000000321f91d: cmovnbe r11d,r8d ; r11d := r8d if num < 0
5,06% ││ 0x000000000321f921: vcvtsi2ss xmm1,xmm1,r11d ; xmm1 := (float) r11d
7,04% ││ 0x000000000321f926: vmulss xmm0,xmm1,xmm0 ; multiply
The key difference is that there's no jump instructions in the mul
method. Instead, conditional move instruction cmovnbe
is used.
cmov
works with integer registers. Since (num < 0 ? 0 : 1)
expression uses integer constants on the right side, JIT is smart enough to emit a conditional move instead of a conditional jump.
In this benchmark, conditional jump is very inefficient, since branch prediction often fails due to random nature of numbers. That's why the branchless code of mul
method appears faster.
If we modify the benchmark in a way that one branch prevails over another, e.g by replacing
ThreadLocalRandom.current().nextFloat() * 2 - 1
with
ThreadLocalRandom.current().nextFloat() * 2 - 0.1f
then the branch prediction will work better, and cmp
method will become as fast as mul
:
Benchmark Mode Cnt Score Error Units
FloatCompare.cmp avgt 5 5,793 ± 0,045 ns/op
FloatCompare.mul avgt 5 5,764 ± 0,048 ns/op