Search code examples
c++multithreadingmpiinfinite-loopstdthread

While loop in main thread is getting stuck when using std::thread


I have a simple C++ code to test and understand threading. The code has the main thread + a secondary thread. The secondary updates the value of a variable which the main thread loop depends on. When I add a print statement inside the main loop the program finishes successfully, but when I remove this print statement it goes into an infinite loop. This is the code that I'm using, and the print statement that I'm referring to is print statement 2

#include <mpi.h>
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
std::mutex mu;
int num;
using namespace std;

void WorkerFunction()
{
    bool work = true;
    while(work)
    {
            mu.lock();
            num --;
            mu.unlock();

            if(num == 1)
               work = false;
    }
}


int main(int argc, char **argv)
{
    bool work = true;
    num = 10;
    int numRanks, myRank, provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
    MPI_Comm_size(MPI_COMM_WORLD, &numRanks);
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

    std::thread workThread (WorkerFunction);
    //print statement 1
    cerr<<"Rank "<<myRank<<" Started workThread \n";

     int mult = 0;
     while(work)
     {
          mult += mult * num;
         //print statement 2
         if(myRank == 0) cerr<<"num = "<<num<<"\n";
         if(num == 1)
           work = false;
      }
   if(work == false)
      workThread.join();

   //print statement 3
   cerr<<"Rank "<<myRank<<" Done with both threads \n";

   MPI_Finalize();

 };

This is the output I get when I have print statement 2

mpirun -np 4 ./Testing
Rank 0 Started workThread 
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
num = 10
Rank 1 Started workThread 
Rank 0 Done with both threads 
Rank 1 Done with both threads 
Rank 2 Started workThread 
Rank 3 Started workThread 
Rank 2 Done with both threads 
Rank 3 Done with both threads

If I comment out that print statement then it goes into an infinte loop and this is the output I get

mpirun -np 4 ./Testing
Rank 0 Started workThread 
Rank 0 Done with both threads 
Rank 1 Started workThread 
Rank 2 Started workThread 
Rank 3 Started workThread 
Rank 2 Done with both threads 
Rank 3 Done with both threads

I'm not sure what am I doing wrong, any help is appreciated.


Solution

  • Concerning MPI, I haven't any experience. (I used it decades ago, and I'm sure that fact is completely worthless.) However, OP claimed

    I have a simple C++ code to test and understand threading.

    Considering, that multiprocessing (with MPI) as well as multithreading (with std::thread) are complicated topics on its own, I would separate the topics first, and try to put them together after having gained some experience in each of them.

    So, I elaborate a bit about the multithreading (which I feel able to).


    First sample is a revised version of OPs code (all references to MPI removed):

    #include <iostream>
    #include <thread>
    #include <mutex>
    #include <chrono>
    
    std::mutex mtxNum;
    int num;
    
    const std::chrono::milliseconds delay(100);
    
    void WorkerFunction()
    {
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        mtxNum.lock();
        num_ = --num;
        mtxNum.unlock();
        work = num_ != 1;
      }
    }
    
    int main()
    {
      num = 10;
      std::thread workThread(&WorkerFunction);
      int mult = 0;
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        mtxNum.lock();
        num_ = num;
        mtxNum.unlock();
        std::cout << "num: " << num_ << '\n';
        mult += mult * num_;
        work = num_ != 1;
      }
      if (workThread.joinable()) workThread.join();
      std::cout << "Both threads done.\n";
    }
    

    Output:

    num: 10
    num: 8
    num: 7
    num: 6
    num: 5
    num: 4
    num: 3
    num: 2
    num: 2
    num: 1
    Both threads done.
    

    Live Demo on coliru

    Notes:

    1. While multithreading is running, and variable num is shared, and variable num is modified in at least one thread, every access should be put into a critical section (a pair of mutex lock and unlock).

    2. The critical section should always be kept as short as possible. (Only one thread can pass the critical section at one time. Hence, it introduces re-serialization which consumes speed-up intended by concurrency.) I introduced a local variable num_ in each thread to copy current value of shared variable and to use it after critical section in the respective thread.*

    3. I added a sleep_for() to both threads for better illustration. Without, I got

      num: 10
      num: 1
      Both threads done.
      

      which I found somehow boring.

    4. The output skips num == 9 and prints num == 2 twice. (This may look differently in other runs.) The reason is that the threads work asynchronously by definition. (The equal delay of 100 milliseconds in both threads is no reliable synchronization.) The OS is responsible to wake a thread if nothing (like e.g. the locked mutex) prevents this. It is free to suspend the thread at any time.

    Concerning mtxNum.lock()/mtxNum.unlock(): Imagine that the critical section contains something more complicated than a simple --num; which may throw an exception. If an exception is thrown, the mtxNum.unlock() is skipped, and a deadlock is produced preventing any thread to proceed.

    For this, the std library provides a nice and handy tool: std::lock_guard:

    #include <iostream>
    #include <thread>
    #include <mutex>
    #include <chrono>
    
    std::mutex mtxNum;
    int num;
    
    const std::chrono::milliseconds delay(100);
    
    void WorkerFunction()
    {
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        { std::lock_guard<std::mutex> lock(mtxNum); // does the mtxNum.lock()
          num_ = --num;
        } // destructor of lock does the mtxNum.unlock()
        work = num_ != 1;
      }
    }
    
    int main()
    {
      num = 10;
      std::thread workThread(&WorkerFunction);
      int mult = 0;
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        { std::lock_guard<std::mutex> lock(mtxNum); // does the mtxNum.lock()
          num_ = num;
        } // destructor of lock does the mtxNum.unlock()
        std::cout << "num: " << num_ << '\n';
        mult += mult * num_;
        work = num_ != 1;
      }
      if (workThread.joinable()) workThread.join();
      std::cout << "Both threads done.\n";
    }
    

    Output:

    num: 10
    num: 8
    num: 7
    num: 6
    num: 5
    num: 4
    num: 3
    num: 2
    num: 1
    Both threads done.
    

    Live Demo on coliru

    The trick with std::lock_guard is that the destructor unlocks the mutex in any case, even if an exception is thrown inside of critical section.

    May be, I'm a bit paranoid but it annoys me that non-guarded access to a shared variable may happen by accident without being noticed in any debugging session nor any compiler diagnostics.** Hence, it might be worth to hide the shared variable into a class where access is only possible with locking it. For this, I introduced Shared in the sample:

    #include <iostream>
    #include <thread>
    #include <mutex>
    #include <chrono>
    
    template <typename T>
    class Shared {
      public:
        struct Lock {
          Shared &shared;
          std::lock_guard<std::mutex> lock;
          Lock(Shared &shared): shared(shared), lock(shared._mtx) { }
          ~Lock() = default;
          Lock(const Lock&) = delete;
          Lock& operator=(const Lock&) = delete;
    
          const T& get() const { return shared._value; }
          T& get() { return shared._value; }
        };
      private:
        std::mutex _mtx;
        T _value;
      public:
        Shared() = default;
        explicit Shared(T &&value): _value(std::move(value)) { }
        ~Shared() = default;
        Shared(const Shared&) = delete;
        Shared& operator=(const Shared&) = delete;
    };
    
    typedef Shared<int> SharedInt;
    SharedInt shNum(10);
    
    const std::chrono::milliseconds delay(100);
    
    void WorkerFunction()
    {
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        { SharedInt::Lock lock(shNum);
          num_ = --lock.get();
        }
        work = num_ != 1;
      }
    }
    
    int main()
    {
      std::thread workThread(&WorkerFunction);
      int mult = 0;
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        int num_;
        { const SharedInt::Lock lock(shNum);
          num_ = lock.get();
        }
        std::cout << "num: " << num_ << '\n';
        mult += mult * num_;
        work = num_ != 1;
      }
      if (workThread.joinable()) workThread.join();
      std::cout << "Both threads done.\n";
    }
    

    Output: similar as before.

    Live Demo on coliru

    The trick is that a reference to shared value can be retrieved from a Shared::Lock instance → i.e. while it is locked. Even if the reference is stored:

        { SharedInt::Lock lock(shNum);
          int &num = lock.get();
          num_ = --num;
        }
    

    The lifetime of int &num just ends before the lifetime of SharedInt::Lock lock(shNum);.

    Of course, one could get a pointer to num to use it outside of scope but I would consider this as sabotage.


    Another thing, I would like to mention is std::atomic:

    The atomic library provides components for fine-grained atomic operations allowing for lockless concurrent programming. Each atomic operation is indivisible with regards to any other atomic operation that involves the same object.

    While a mutex may be subject of OS kernel functions, an atomic access might be done exploiting CPU features without the necessity to enter the kernel. (This might provide speed-up as well as result in less usage of OS resources.)

    Even better, if there is no H/W support for the resp. type available it falls back to an implementation based on mutexes or other locking operations (according to the Notes in std::atomic<T>::is_lock_free()):

    All atomic types except for std::atomic_flag may be implemented using mutexes or other locking operations, rather than using the lock-free atomic CPU instructions. Atomic types are also allowed to be sometimes lock-free, e.g. if only aligned memory accesses are naturally atomic on a given architecture, misaligned objects of the same type have to use locks.

    The modified sample with std::atomic:

    #include <iostream>
    #include <thread>
    #include <atomic>
    #include <chrono>
    
    std::atomic<int> num;
    
    const std::chrono::milliseconds delay(100);
    
    void WorkerFunction()
    {
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        work = --num != 1;
      }
    }
    
    int main()
    {
      num = 10;
      std::thread workThread(&WorkerFunction);
      int mult = 0;
      for (bool work = true; work; std::this_thread::sleep_for(delay)) {
        const int num_ = num;
        std::cout << "num: " << num_ << '\n';
        mult += mult * num_;
        work = num_ != 1;
      }
      if (workThread.joinable()) workThread.join();
      std::cout << "Both threads done.\n";
    }
    

    Output:

    num: 10
    num: 8
    num: 7
    num: 7
    num: 5
    num: 4
    num: 3
    num: 3
    num: 1
    Both threads done.
    

    Live Demo on coliru


    * I brooded a while over the WorkingThread(). If it's the only thread which modifies num, the read access to num (in WorkingThread()) outside critical section should be safe – I believe. However, at least, for the sake of maintainability I wouldn't do so.

    ** According to my personal experience, such errors occur rarely (or never) in debug sessions but in the first 180 seconds of a presentation to a customer.