Search code examples
c++multithreadingcondition-variable

Worker Thread permanently hibernates, after executing too fast


I am trying to incorporate threads into my project, but have a problem where using merely 1 worker thread makes it "fall asleep" permanently. Perhaps I have a race condition, but just can't notice it.

My PeriodicThreads object maintains a collection of threads. Once PeriodicThreads::exec_threads() has been invoked, the threads are notified, are awaken and preform their task. Afterwards, they fall back to sleep.

Function of such a worker-thread:

void PeriodicThreads::threadWork(size_t threadId){
    //not really used, but need to decalre to use conditional_variable:
    std::mutex mutex;
    std::unique_lock<std::mutex> lck(mutex);

    while (true){
        // wait until told to start working on a task:
        while (_thread_shouldWork[threadId] == false){
            _threads_startSignal.wait(lck);
        }

        thread_iteration(threadId);    //virtual function

        _thread_shouldWork[threadId] = false;   //vector of flags
        _thread_doneSignal.notify_all();

    }//end while(true) - run until terminated externally or this whole obj is deleted 
}

As you can see, each thread is monitoring its own entry in a vector of flags, and once it sees that it's flag is true - performs the task then resets its flag.

Here is the function that can awaken all the threads:

std::atomic_bool _threadsWorking =false;

//blocks the current thread until all worker threads have completed:
void PeriodicThreads::exec_threads(){
    if(_threadsWorking ){ 
        throw std::runtime_error("you requested exec_threads(), but threads haven't yet finished executing the previous task!");
    }

    _threadsWorking = true;//NOTICE: doing this after the exception check.

    //tell all threads to unpause by setting their flags to 'true'
    std::fill(_thread_shouldWork.begin(),  _thread_shouldWork.end(),  true);
    _threads_startSignal.notify_all();

    //wait for threads to complete:

    std::mutex mutex;
    std::unique_lock<std::mutex> lck(mutex); //lock & mutex are not really used.

    auto isContinueWaiting = [&]()->bool{
        bool threadsWorking = false; 
        for (size_t i=0;  i<_thread_shouldWork.size();  ++i){
            threadsWorking |= _thread_shouldWork[i];
        }
        return threadsWorking;
    };

    while (isContinueWaiting()){
        _thread_doneSignal.wait(lck);
    }

    _threadsWorking = false;//set atomic to false 
}

Invoking exec_threads() works fine for several hundred or in rare cases several thousand consecutive iterations. Invocations occur from the main thread's while loop. Its worker thread processes the task, resets its flag and goes back to sleep until the next exec_threads(), and so on.

However, some time after that, the program snaps into a "hibernation", and seems to pause, but doesn't crash.

During such a "hibernation" putting a breakpoint at any while-loop of my condition_variables never actualy causes that breakpoint to trigger.


Being sneaky, I've created my own verify-thread (parallel to main) and monitor my PeriodicThreads object. As it falls into hibernation, my verify-thread keeps outputting to the console me that no threads are currently running (the _threadsWorking atomic of PeriodicThreads is permanently set to false). However, during the other tests the atomic remains as true, once that "hibernation issue" begins.

The strange thing is that if I force the PeriodicThreads::run_thread to sleep for at least 10 microseconds before resetting its flag, things work as normal, and no "hibernation" occurs. Otherwise, if we allow thread to complete it's task very quickly it might cause this whole issue.

I've wrapped each condition_variable inside a while loop to prevent spurious wakes from triggering transition, and situation where notify_all is called before .wait() is called on it. Link

Notice, this occurs even when I have only 1 worker thread

What could be the cause?

Edit

Abandoning these vector flags and just testing on a single atomic_bool with 1 worker thread still shows the same issue.


Solution

  • All shared data should be protected by a mutex. The mutex should have (at least) the same scope as the shared data.

    Your _thread_shouldWork container is shared data. You can make a global array of mutexes and each one can protect its own _thread_shouldWork element. (see note below). You should also have at least as many condition variables as you have mutexes. (You can use 1 mutex with several different condition variables, but you should not use several different mutexes with 1 condition variable.)

    A condition_variable should protect an actual condition (in this case, the state of an individual element of _thread_shouldWork at any given point) and the mutex is used to protect the variables that encompass that condition.

    If you're just using a random local mutex (as you are in your thread code) or just not using a mutex at all (in the main code), then all bets are off. It's undefined behavior. Although I could see it working (by luck) most of the time. What I suspect is happening is that a worker thread is missing the signal from the main thread. It could also be that your main thread is missing the signal from a worker thread. (Thread A reads the state and enters the while loop, then Thread B changes the state and sends the notification, then Thread A goes to sleep... waiting for a notification that was already sent)

    Mutexes with local scope are a red flag!

    Note: If you're using a vector, you have to watch out because adding or removing items can trigger a resize which will touch elements without grabbing the mutex first (because of course the vector doesn't know about your mutex).

    You also have to watch out for false sharing when using arrays

    Edit: Here's a video that @Kari found useful for explaining false sharing https://www.youtube.com/watch?v=dznxqe1Uk3E