Search code examples
c++multithreadingatomicthread-sleepstdthread

With very short sleep times, why does a thread only finish zero or one iteration of printing before seeing the stop flag set?


See the code below, AsyncTask creates a peer thread(timer) to increment a atomic variable and sleep for a while. The expected output is to print counter_ 10 times, with values ranging from 1 to 10, but the actual result is strange:

  • It seems like that the actual result is random, sometimes it's printed once, sometimes it's not printed at all.
  • Further, I found that when I changed thread sleep time(both peer thread and main thread) to seconds or milliseconds, the program worked as expected.
#include <atomic>
#include <thread>
#include <iostream>

class AtomicTest {
 public:
  int AsyncTask() {
    std::thread timer([this](){
      while (not stop_.load(std::memory_order_acquire)) {
        counter_.fetch_add(1, std::memory_order_relaxed);
        std::cout << "counter = " << counter_ << std::endl;
        std::this_thread::sleep_for(std::chrono::microseconds(1)); // both milliseconds and seconds work well
      }
    });
    timer.detach();

    std::this_thread::sleep_for(std::chrono::microseconds(10));
    stop_.store(true, std::memory_order_release);
    return 0;
  }

 private:
  std::atomic<int> counter_{0};
  std::atomic<bool> stop_{false};
};

int main(void) {
  AtomicTest test;
  test.AsyncTask();
  return 0;
}

I know that thread switching also takes time, is it because thread sleep time too short?

My programme running environment:

  • Apple clang version 14.0.0 (clang-1400.0.29.202)
  • Target: arm64-apple-darwin22.2.0)

Solution

  • Yes, easily plausible that stop_.store could run before the new thread has been scheduled to a CPU core, or soon after. So its first test reads the stop flag as true.

    10 us is shorter than typical OS process-scheduling timeslices (often 1 or 10 ms) in case that's relevant. And only a couple orders of magnitude higher than inter-core latency for an atomic store becoming visible.

    The results you describe are exactly what I'd expect for a timing-dependent program like this, written to detect which thread wins the race and by how much (with its slow << endl and sleep inside the writing thread.)

    I definitely wouldn't expect it to always print 10 times, and it would be rare that'd ever happen due to thread startup overhead being a significant fraction of the 1 us sleep interval inside the printing thread.


    BTW, your question was originally titled "A question about incrementing atomic variables?". But counter is only ever accessed from one thread. It's probably in the same cache line as the stop flag, but without contention from the main thread it's basically trivial, a very fast operation.

    It's irrelevant to what you're doing; it could be a local non-atomic int inside the thread's lambda and you'd see the same timing effects. The significant things here are cout << endl which forces a flush of the stream (and thus a system call) even if you redirected to a file, and the this_thread::sleep_for().

    If the write system call was to a terminal (not redirect to a file), it might even block while the terminal emulator drew on the screen, although for only a couple small writes there's probably a big enough buffer somewhere (probably inside the kernel) to absorb it.

    An atomic increment probably takes a few nanoseconds, and being relaxed it's something AArch64 can handle very efficiently, overlapping much of that time with surrounding code. (Modern x86 can do an atomic increment about one per 20 clock cycles at best, and that includes a full memory barrier. I expect Apple M1 to handle it more cheaply when it doesn't need to be a barrier.)