Data races, UB, and counters in C++11

The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:

int num_times_done_it; // global

void doit() {
  ++num_times_done_it;
  // do something
}

void report_stats() {
  printf("called doit %i times\n", num_times_done_it);
  // and probably some other stuff too
}

Unfortunately, if multiple threads can call doit without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats can be called concurrently with doit absent any synchronisation, there's another data race between the thread modifying num_times_done_it and the thread reporting its value.

Often, the programmer just wants a mostly-right count of the number of times doit has been called with as little overhead as possible.

(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)

Apparent non-solutions:

Atomics, with any memory order I know of, fail "as little overhead as possible" here (an atomic increment can be considerably more expensive than an ordinary increment) while overdelivering on "mostly-right" (by being exactly right).
I don't believe tossing volatile into the mix makes data races OK, so replacing the declaration of num_times_done_it by volatile int num_times_done_it doesn't fix anything.
There's the awkward solution of having a separate counter per thread and adding them all up in report_stats, but that doesn't solve the data race between doit and report_stats. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.

Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?

EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed:

atomic<int> num_times_done_it;
void doit() {
  num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
                          memory_order_relaxed);
  // as before
}

However, gcc 4.8.2 generates this code on x86_64 (with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

and clang 3.4 generates this code on x86_64 (again with -O3):

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax
   6:   ff c0                   inc    %eax
   8:   89 05 00 00 00 00       mov    %eax,0x0(%rip)

My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc and the one-instruction memory add generated by the straightforward code. Does this use of memory_order_relaxed constitute a data race?

Solution

It seems that the memory_order_relaxed trick is the right way to do this.

This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed store and load as the proper alternative.

I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed in the first place.