java caching optimization time micro-optimization

Java Micro-optimization: To cache or not to cache a System.currentTimeMillis() return value?

Simple question, which I've been wondering. Of the following two versions of code, which is better optimized? Assume that the time value resulting from the System.currentTimeMillis() call only needs to be pretty accurate, so caching should only be considered from a performance point of view.

This (with value caching):

    long time = System.currentTimeMillis();
    for (long timestamp : times) {
        if (time - timestamp > 600000L) {
            // Do something
        }
    }

Or this (no caching):

    for (long timestamp : times) {
        if (System.currentTimeMillis() - timestamp > 600000L) {
            // Do something
        }
    }

I'm assuming System.currentTimeMillis() is already a very optimized and lightweight method call, but let's assume I'll be calling it many, many times in a short period.

How many values must the "times" collection/array contain to justify caching the return value of System.currentTimeMillis() in its own variable?

Is this better to do from a CPU or memory optimization point of view?

Solution

A long is basically free. A JVM with a JIT compiler can keep it in a register, and since it's a loop invariant can even optimize your loop condition to -timestamp < 600000L - time or timestamp > time - 600000L. i.e. the loop condition becomes a trivial compare between the iterator and a loop-invariant constant in a register.

So yes it's obviously more efficient to hoist a function call out of a loop and keep the result in a variable, especially when the optimizer can't do that for you, and especially when the result is a primitive type, not an Object.

Assuming your code is running on a JVM that JITs x86 machine code, System.currentTimeMillis() will probably include at least an rdtsc instruction and some scaling of that result¹. So the cheapest it can possibly be (on Skylake for example) is a micro-coded 20-uop instruction with a throughput of one per 25 clock cycles (http://agner.org/optimize/).

If your // Do something is simple, like just a few memory accesses that usually hit in cache, or some simpler calculation, or anything else that out-of-order execution can do a good job with, that could be most of the cost of your loop. Unless each loop iterations typically takes multiple microseconds (i.e. time for thousands of instructions on a 4GHz superscalar CPU), hoisting System.currentTimeMillis() out of the loop can probably make a measurable difference. Small vs. huge will depend on how simple your loop body is.

If you can prove that hoisting it out of your loop won't cause correctness problems, then go for it.

Even with it inside your loop, your thread could still sleep for an unbounded length of time between calling it and doing the work for that iteration. But hoisting it out of the loop makes it more likely that you could actually observe this kind of effect in practice; running more iterations "too late".

Footnote 1: On modern x86, the time-stamp counter runs at a fixed rate, so it's useful as a low-overhead timesource, and less useful for cycle-accurate micro-benchmarking. (Use performance counters for that, or disable turbo / power saving so core clock = reference clock.)

IDK if a JVM would actually go to the trouble of implementing its own time function, though. It might just use an OS-provided time function. On Linux, gettimeofday and clock_gettime are implemented in user-space (with code + scale factor data exported by the kernel into user-space memory, in the VDSO region). So glibc's wrapper just calls that, instead of making an actual syscall.

So clock_gettime can be very cheap compared to an actual system call that switches to kernel mode and back. That can take at least 1800 clock cycles on Skylake, on a kernel with Spectre + Meltdown mitigation enabled.

So yes, it's hopefully safe to assume System.currentTimeMillis() is "very optimized and lightweight", but even rdtsc itself is expensive compared to some loop bodies.