java cpu-architecture atomic benchmarking jmh

Do i really well interpret JMH results for synchronized counter and AtomicInteger

I wondered how much AtomicInteger is more performat then synchronized volatile increment in single-threaded environment, and write such a JMH Benchmarks:

@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class CounterBenchmark {

    @State(Scope.Benchmark)
    public static class ThreadState {

        public final AtomicInteger atomicInteger = new AtomicInteger();
        public final SynchronizedCounter synchronizedCounter = new SynchronizedCounter();
    }

    public static class SynchronizedCounter {
        private volatile int counter = 0;

        public synchronized int incrementAndGet() {
            return counter++;
        }
    }

    @Benchmark
    public void atomicCounter(ThreadState state, Blackhole blackhole) {
        blackhole.consume(state.atomicInteger.incrementAndGet());
    }

    @Benchmark
    public void syncCounter(ThreadState state, Blackhole blackhole) {
        blackhole.consume(state.synchronizedCounter.incrementAndGet());
    }


    public static void main(String... args) throws Exception {

        Options options = new OptionsBuilder()
                .include(CounterBenchmark.class.getSimpleName())
                .forks(1)
                .threads(1)
                .warmupTime(TimeValue.seconds(1))
                .warmupIterations(5)
                .measurementTime(TimeValue.seconds(1))
                .measurementIterations(5)
                .mode(Mode.Throughput)
                .build();

        new Runner(options).run();
    }
}

And got such output:

Benchmark Mode Cnt Score Error Units MySerializationBenchmark.atomicCounter thrpt 5 145125.272 ± 2064.673 ops/ms MySerializationBenchmark.syncCounter thrpt 5 105763.168 ± 1794.680 ops/ms

And results seems to fast. 145k operations per 1ms (145kk per second) for AtomicInteger and 100kk for synchronized. Am i doing things wrong, or this is real performance. `

I have googled things as "common java latency numbers" and so on, but decided to do benchmarks by me

Solution

You only have one thread so there's no contention for the lock, and not even any time for the cache line to bounce around from the previous owner to the current core doing an RMW because it's always the same core.

Uncontended atomic increments (that hit in L1d cache) can run about 1 per 8 or 18 cycles on a current AMD or Intel CPU, respectively. 145.125 Mops / sec * (18 cycles/op) ~= 2612 Mcycles / sec, so a 2.6 GHz Intel Xeon core could do this. I haven't looked for atomic RMW latency / throughput numbers for any ARM cores, but they're probably also pretty good.

You didn't say what CPU you tested on; I'm just guessing an Intel from the last 8 years because that leads to a reasonable CPU frequency for a server core; low for a desktop or recent laptop. (Or actual frequency could be higher if my estimate of cycles / iteration is low due to some extra overhead.)

(You can't add up cycle costs across operations in general, but your loops only involve the one significant operation which alone will be the bottleneck. A volatile increment should JIT to the same asm as AtomicInteger, but you also use synchronized. So IDK why it's not twice as slow. Maybe the JIT can see there's only one thread and avoid doing actual locking with an atomic RMW? Or there's some other overhead in the function, perhaps some stores due to the black-hole, so adding another atomic RMW on a different location (the lock that synchronized uses) wouldn't be twice as slow? In that case my CPU frequency estimate is wrong.)

My source for the 8 and 18 cycle numbers is https://uops.info/. Look for lock add or lock xadd in the table. The table shows latency, throughput, and micro-op (uop) counts. Latency is complicated because the operation has 3 inputs: an address, a register input, an output in FLAGS (and also register for xadd), and the contents of memory. uops.info attempts to measure latency from each input to each output with different microbenchmarks, because those can all be different, especially for multi-uop instructions.

Specifically XADD_LOCK (M32, R32) or ADD_LOCK (M32, I8) since you're using 32-bit integers. (XADD produces the previous value of the memory location, like fetch_add needs. Plain ADD doesn't, just the FLAGS result of whether it was zero or not, and so on. Since you're using blackhole.consume on the increment result, the JIT optimizer will have to use something like mov eax, 1 / lock xadd [rdi], eax instead of lock add dword ptr [rdi], 1)

The relevant bottleneck in this case is latency from memory to memory since you're doing increments on the same object. e.g. https://uops.info/html-lat/ZEN4/XADD_LOCK_M32_R32-Measurements.html#lat1-%3E1_mem shows Zen 4's Latency operand 1 → 1 (memory): 8 result: 8 cycles for back-to-back increments on the same memory location. (Zen 3 has the same performance for that).

(The throughput of a thread doing multiple independent atomic RMWs on different memory locations is surprisingly slightly better than that, despite locked instructions being full memory barriers. But there's not much pipelining so the throughput numbers directly visible in the main table are very close. However, the latency numbers for mem->mem are whole numbers like we'd expect. Intel numbers came out as I'd expect, with throughput exactly equal to latency due to the full memory barrier.)

Numbers from recent mainstream x86 CPUs for mem->mem latency for lock xadd [mem], r32, which will also be the cycles / iteration for a loop that does an atomic increment on an object with no other memory operations, just some register loop-counter stuff:

Skylake / Cascade Lake / Ice Lake / Alder Lake P-cores and Sapphire Rapids: 18. Also Alder Lake E-cores.
Zen 1: 8 cycles, but only as part of a chain of operations that isn't just atomic RMWs, because atomic RMW throughput is lower, like 28.4 cycles in a microbenchmark loop that does exactly what you're doing, incrementing the same location in a loop.
Zen 2: 15 cycles, but again back-to-back increments of the same location are limited by atomic RMW throughput of 17.7 cycles
Zen 3: 8 cycles, with actual increments of a single location in a loop able to match that, no extra bottlenecks to RMW throughput.
Zen 4: 8 cycles, and again a microbenchmark of just incrementing the same location in a loop confirms that really can sustain 8 cycles per increment.

(When I started writing this answer, I wasn't expecting the Zen 1/2 effect of latency being better than throughput. That's interesting. But https://uops.info/ shows the instruction sequences they microbenchmarked with, and the testing is automated and reliable, and reports actual "APERF" cycle counts for each individual test, so the raw data is there. Anyway, not relevant to your test even if you were using an early Ryzen, since the only significant thing in your loop should be the atomic RMW; out-of-order exec of the register operations to keep track of the loop counter isn't part of the bottleneck.)