The following pattern is commonplace in lots of software that wants to tell its user how many times it has done various things:
int num_times_done_it; // global
void doit() {
++num_times_done_it;
// do something
}
void report_stats() {
printf("called doit %i times\n", num_times_done_it);
// and probably some other stuff too
}
Unfortunately, if multiple threads can call doit
without some sort of synchronisation, the concurrent read-modify-writes to num_times_done_it
may be a data race and hence the entire program's behaviour would be undefined. Further, if report_stats
can be called concurrently with doit
absent any synchronisation, there's another data race between the thread modifying num_times_done_it
and the thread reporting its value.
Often, the programmer just wants a mostly-right count of the number of times doit
has been called with as little overhead as possible.
(If you consider this example trivial, Hogwild! gains a significant speed advantage over a data-race-free stochastic gradient descent using essentially this trick. Also, I believe the Hotspot JVM does exactly this sort of unguarded, multithreaded access to a shared counter for method invocation counts---though it's in the clear since it generates assembly code instead of C++11.)
Apparent non-solutions:
volatile
into the mix makes data races OK, so replacing the declaration of num_times_done_it
by volatile int num_times_done_it
doesn't fix anything.report_stats
, but that doesn't solve the data race between doit
and report_stats
. Also, it's messy, it assumes the updates are associative, and doesn't really fit Hogwild!'s usage.Is it possible to implement invocation counters with well-defined semantics in a nontrivial, multithreaded C++11 program without some form of synchronisation?
EDIT: It seems that we can do this in a slightly indirect way using memory_order_relaxed
:
atomic<int> num_times_done_it;
void doit() {
num_times_done_it.store(1 + num_times_done_it.load(memory_order_relaxed),
memory_order_relaxed);
// as before
}
However, gcc 4.8.2
generates this code on x86_64 (with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: 83 c0 01 add $0x1,%eax
9: 89 05 00 00 00 00 mov %eax,0x0(%rip)
and clang 3.4
generates this code on x86_64 (again with -O3):
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
6: ff c0 inc %eax
8: 89 05 00 00 00 00 mov %eax,0x0(%rip)
My understanding of x86-TSO is that both of these code sequences are, barring interrupts and funny page protection flags, entirely equivalent to the one-instruction memory inc
and the one-instruction memory add
generated by the straightforward code. Does this use of memory_order_relaxed
constitute a data race?
It seems that the memory_order_relaxed
trick is the right way to do this.
This blog post by Dmitry Vyukov at Intel begins by answering exactly my question, and proceeds to list the memory_order_relaxed
store
and load
as the proper alternative.
I am still unsure of whether this is really OK; in particular, N3710 makes me doubt that I ever understood memory_order_relaxed
in the first place.