Search code examples
benchmarkingcompiler-optimizationinline-assemblygoogle-benchmarkinstruction-reordering

How does Google's `DoNotOptimize()` function enforce statement ordering


I'm trying to understand exactly how Google's DoNotOptimize() is supposed to work.

For completeness, here is its definition (for clang, and non-const data):

template <class Tp>
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
  asm volatile("" : "+r,m"(value) : : "memory");
}

As I understand we can use this in code like this:

start_time = time();
bench_output = run_bench(bench_inputs);
result = time() - start_time;

To ensure that the benchmark stays in the critical section:

start_time = time();
DoNotOptimize(bench_inputs);
bench_output = run_bench(bench_inputs);
DoNotOptimise(bench_output);
result = time() - start_time;

Specifically what I don't understand is why this guarantees (does it?) that run_bench() is not moved above start_time = time().

(Someone asked exactly this in this comment, however I don't understand the answer).

As I understand, the above DoNotOptimze() does several things:

  • It forces value to the stack, as it is passed by C++ reference. You can't have a pointer to a register, so it must be in memory.
  • Because value is now on the stack, subsequently clobbering memory (as done in the asm constraints) will force the compiler to assume that value is both read and written by the call to DoNotOptimize(value).
  • (it's not clear to me if the +r,m constraint is relevant. As far as I know this says that the pointer itself may be stored in a register or in memory, but the pointer value itself may be read and/or written.)

And this is where things get fuzzy for me.

If start_time is also stack allocated, the memory clobbering in DoNotOptimize() will mean that the compiler must assume that DoNotOptimize() potentially reads start_time. Therefore the order of the statements can only be:

start_time = time(); // on the stack
DoNotOptimize(bench_inputs); // reads start_time, writes bench_inputs
bench_output = run_bench(bench_inputs)

But if start_time is not stored in memory, but instead in a register, then clobbering memory will not clobber start_time, right? In that case the desired ordering of start_time = time() and DoNotOptimize(bench_inputs) is lost and the compiler is free to do:

DoNotOptimize(bench_inputs); // only writes bench_inputs
bench_output = run_bench(bench_inputs)
start_time = time(); // in a register

Obviously I've misunderstood something. Can anyone help explain? Thanks :)

I'm wondering if this is because reordering optimisations happen prior to register allocation, and thus everything is assumed to be stack allocated at that time. But if that were the case, then DoNotOptimize() would be redundant, as ClobberMemory() would be sufficient.


Solution

  • Summary: DoNotOptimize is ordered wrt. time() by the "memory" clobbers, as if it were another function call to an opaque function that could modify any global state.

    DoNotOptimize is ordered wrt. the computation of output from input by the data dependency of the calculation on the input, and the output on the calculation, as Chandler Carruth explained in the Q&A you linked. The "memory" clobber is irrelevant for this part.


    "memory" clobber is like a non-inline function call

    DoNotOptimize's asm statement contains a "memory" clobber. As far as the optimizer is concerned, that's equivalent to an opaque function call: it has to be assumed to read and write every globally-reachable object1. (Even ones this compilation unit might not know about.)

    Since time() itself doesn't have an inline definition in any header, it can't reorder with DoNotOptimize at compile time for the same reason that a compiler can't reorder calls to foo() and bar() when it can't see the definitions of those functions. Same reason compilers don't need any special logic to stop them from reordering puts("hi"); puts("mom");.

    (A hypothetical time() that could inline and only contained an asm statement would have to use asm volatile to make sure repeated calls didn't just use the first one's output. asm volatile statements can't reorder with each other or accesses to volatile variables, so that would be ok too, for a different reason.)

    Footnote 1: Globally reachable = any object that might be pointed-to by any hypothetical global variable. i.e. anything except local variables within this function, or memory freshly allocated with new, if escape analysis can prove that nothing outside this function could have pointers to them.


    How the asm statement works

    I think you're pretty seriously misunderstanding how the asm works. "+r,m" tells the compiler to materialize the value in a register (or memory if it wants), and then use the value there at the end of the (empty) asm template as the new value of that C++ object.

    So it forces the compiler to actually materialize (produce) the value somewhere, which means it has to be computed. And it means has to forget what it previously knew about the value (e.g. that it was a compile time constant 5, or non-negative, or anything) because the "+" modifier declares a read/write operand.

    The point of DoNotOptimize on the input is to defeat constant-propagation that would let the benchmark optimize away.

    And on the output to make sure a final result is actually materialized in a register (or memory) instead of optimizing away all the computation leading to an unused result. (This is where being asm volatile is relevant; defeating constant-propagation still works with non-volatile asm.)

    So the computation you want to benchmark has to happen between the two DoNotOptimize() statements, and separately those two statements can't reorder with time().

    The compiler has to assume that the asm statement modifies the value like val ^= random for all it knows, along with changing the value in memory of any/every other object except for private locals that weren't operands, so e.g. the "memory" clobber doesn't stop the compiler from keeping a local loop counter in memory. (It doesn't special case an empty asm template string here; programs don't contain asm statements like this by accident so nobody wants them optimized away.)


    Misconceptions about the reference arg and picking "m"

    I only got part way into the details of your attempt to reason about the "+r,m" operand and the reference function-arg before deciding it would probably be better to just explain from scratch. The correct reason isn't that complicated. But a couple things are worth specifically correcting:

    The C++ function containing the asm statement can inline, letting the by-reference function arg optimize away. (It's even declared inline __attribute__((always_inline)) to force inlining even with optimization disabled, although in that case the reference variable won't optimize away.)

    The net result is as if the asm statement were used directly on the C++ variable passed to DoNotOptimize. e.g. DoNotOptimize(foo) is like asm volatile("" : "+r,m"(foo) :: "memory")

    The compiler can always pick register if it wants to, e.g. choosing to load a variable's value into a register before an asm statement. (And if the C++ semantics demand updating the variable's value in memory, also emitting a store instruction after the asm statement.)

    For example, we can see that GCC does choose to do that. (I guess I could have used incl %0 as the example, but I just chose nop as a way to show what the compiler picked for the operand location as an alternative to # %0 pure comment, so the Godbolt compiler explorer wouldn't filter it out.)

    void foo(int *p)
    {
        asm volatile("nop # operand picked %0" : "+r,m" (p[4]) );
    }
    
    # GCC 11.2 -O2
    foo(int*):
            movl    16(%rdi), %eax
            nop # operand picked %eax
            movl    %eax, 16(%rdi)
            ret
    

    vs. clang choosing to leave the value in memory, so every instruction in the asm template would be accessing memory instead of a register. (If there were any instructions).

    # clang 12.0.1 -O2 -fPIE
    foo(int*):                               # @foo(int*)
            nop     # operand picked 16(%rdi)
            retq
    

    Fun fact: "r,m" is an attempt to work around a clang missed-optimization bug that makes it always pick memory for "rm" constraints, even if the value was already in a register. Spilling it first, even if it has to invent a temporary location for the value of an expression as an input.