See the code below, AsyncTask
creates a peer thread(timer) to increment a atomic variable and sleep for a while. The expected output is to print counter_
10 times, with values ranging from 1 to 10, but the actual result is strange:
#include <atomic>
#include <thread>
#include <iostream>
class AtomicTest {
public:
int AsyncTask() {
std::thread timer([this](){
while (not stop_.load(std::memory_order_acquire)) {
counter_.fetch_add(1, std::memory_order_relaxed);
std::cout << "counter = " << counter_ << std::endl;
std::this_thread::sleep_for(std::chrono::microseconds(1)); // both milliseconds and seconds work well
}
});
timer.detach();
std::this_thread::sleep_for(std::chrono::microseconds(10));
stop_.store(true, std::memory_order_release);
return 0;
}
private:
std::atomic<int> counter_{0};
std::atomic<bool> stop_{false};
};
int main(void) {
AtomicTest test;
test.AsyncTask();
return 0;
}
I know that thread switching also takes time, is it because thread sleep time too short?
My programme running environment:
Yes, easily plausible that stop_.store
could run before the new thread has been scheduled to a CPU core, or soon after. So its first test reads the stop flag as true
.
10 us is shorter than typical OS process-scheduling timeslices (often 1 or 10 ms) in case that's relevant. And only a couple orders of magnitude higher than inter-core latency for an atomic store becoming visible.
The results you describe are exactly what I'd expect for a timing-dependent program like this, written to detect which thread wins the race and by how much (with its slow << endl
and sleep inside the writing thread.)
I definitely wouldn't expect it to always print 10 times, and it would be rare that'd ever happen due to thread startup overhead being a significant fraction of the 1 us sleep interval inside the printing thread.
BTW, your question was originally titled "A question about incrementing atomic variables?". But counter
is only ever accessed from one thread. It's probably in the same cache line as the stop flag, but without contention from the main thread it's basically trivial, a very fast operation.
It's irrelevant to what you're doing; it could be a local non-atomic int
inside the thread's lambda and you'd see the same timing effects. The significant things here are cout << endl
which forces a flush of the stream (and thus a system call) even if you redirected to a file, and the this_thread::sleep_for()
.
If the write system call was to a terminal (not redirect to a file), it might even block while the terminal emulator drew on the screen, although for only a couple small writes there's probably a big enough buffer somewhere (probably inside the kernel) to absorb it.
An atomic increment probably takes a few nanoseconds, and being relaxed
it's something AArch64 can handle very efficiently, overlapping much of that time with surrounding code. (Modern x86 can do an atomic increment about one per 20 clock cycles at best, and that includes a full memory barrier. I expect Apple M1 to handle it more cheaply when it doesn't need to be a barrier.)