Search code examples
javaperformanceperfjmh

Confusing branches and L1-dcache-loads in the output of JMH's LinuxPerfNormProfiler


I'm measuring costs of busy-waiting implemented with two approaches:

// 1
while(run) { // looping on volatile flag
}

// 2
while(run) { // looping on volatile flag
  Thread.onSpinWait();
}

Complete code of the examples is available on GitHub via link1 and link2. I'm running the benchmarks with -prof perfnorm and they yield:

Benchmark                                                        Mode  Cnt      Score      Error      Units
WhileTrueBenchmark.whileTrue                                     avgt   20   6460.700 ±  109.333      ns/op
WhileTrueBenchmark.whileTrue:CPI                                 avgt    4      0.444 ±    0.005  clks/insn
WhileTrueBenchmark.whileTrue:IPC                                 avgt    4      2.252 ±    0.026  insns/clk
WhileTrueBenchmark.whileTrue:L1-dcache-loads                     avgt    4  51523.529 ± 3556.009       #/op
WhileTrueBenchmark.whileTrue:branches                            avgt    4  13981.285 ±  958.249       #/op
WhileTrueBenchmark.whileTrue:cycles                              avgt    4  36407.576 ± 2434.292       #/op
WhileTrueBenchmark.whileTrue:instructions                        avgt    4  81985.523 ± 6300.983       #/op


ThreadOnSpinWaitPlainBenchmark.onSpinWait                        avgt   20   6463.334 ±   49.922      ns/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:CPI                    avgt    4      2.143 ±    0.056  clks/insn
ThreadOnSpinWaitPlainBenchmark.onSpinWait:IPC                    avgt    4      0.467 ±    0.012  insns/clk
ThreadOnSpinWaitPlainBenchmark.onSpinWait:L1-dcache-loads        avgt    4   7262.587 ±  324.600       #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:branches               avgt    4   2951.111 ±  162.867       #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:cycles                 avgt    4  36307.064 ± 1516.787       #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:instructions           avgt    4  16943.396 ±  820.446       #/op

So from the output we see that for the same time elapsed we have the same cycle count, but Thread.onSpinWait() executed almost 5 times less instructions. This is understandable and expected behavior.

What is unexpected to me is that it produced much less branches and L1-dcache-loads. The benchmarked code doesn't have much branching and reads one and the same flag from memory.

So why are these two metrics (branches and L1-dcache-loads) so different?


Solution

  • On x86, Thread.onSpinWait() intrinsic is translated to the PAUSE instruction. PAUSE delays the execution of the next instruction for an implementation-specific amount of time. Because of this delay, the second loop executes the less number of times comparing to the loop without onSpinWait.

    Extra delay per each loop iteration => less number of iterations => less number of retired instructions (including cmp and jne) => less branches and memory loads.