c++multithreading communication race-condition

Bidirectional thread communication with multiple condition_variable has rare hang / race condition

I have a rather strange example, and so I will just put the context out here briefly, and we can hopefully just pretend it's a good idea.

I'm using a profiler that requires regular calls to its FRAME() macro so that it knows where CPU frames of a game start and end (the object the macro builds is RAII/scope based). I'm using fibers for my threading (main 'thread' is also a fiber worker), and this profiling macro only supports being called from a thread not registered with the profiler as a fiber worker thread. Ergo, I have this awful solution in the short-term, where I communicate to a separate thread just for this macro. The goal is to get the timing of construction/destruction of the RAII object as accurate as possible on this separate thread without disrupting the calling thread's timing. But sometimes, the entire application hangs. I don't understand how that's possible here.

Main 'thread' (actually on a fiber but that doesn't matter) / game loop:

FrameProfile frameProfile("Client Update");
while (!bShouldQuit)
{
    frameProfile.StartFrame();
    
    /* Do the game client's work for this frame */

    frameProfile.EndFrame();
}

And then this FrameProfile object is responsible for spinning up a separate thread, and will let that thread enter the FRAME macro scope when StartFrame is called from the above, and that thread will sleep in that scope until EndFrame is called, at which point it will wake up and exit the scope, destroying the profiler's frame-measuring object, and giving us a hopefully-accurate frame time.

struct FrameProfile
{
    FrameProfile(const char* tag)
    {
        pthread_ = std::make_unique<std::thread>(
            [tag, this](std::atomic_bool& killFlag) {
                while (!killFlag)
                {
                    assert(!endThreadFrame.WasSignalled());
                    startThreadFrame.WaitConsume();
                    {
                        assert(!startThreadFrame.WasSignalled());
                        assert(!endedThreadFrame.WasSignalled());

                        // Construct the frame-measuring object using this macro
                        OPTICK_FRAME(tag);

                        startedThreadFrame.Signal();

                        endThreadFrame.WaitConsume();
                        // endThreadFrame has been signalled - we need to exit scope
                        // to finish measuring ASAP
                    }
                    assert(!endThreadFrame.WasSignalled());
                    endedThreadFrame.Signal();
                }
            },
            std::ref(bKill_)
        );
    }

    ~FrameProfile()
    {
        bKill_ = true;
        if (pthread_)
        {
            if (pthread_->joinable())
            {
                pthread_->join();
            }
        }
    }

    void StartFrame()
    {
        assert(!startThreadFrame.WasSignalled());
        assert(!startedThreadFrame.WasSignalled());

        // Tell thread to start measuring the frame
        startThreadFrame.Signal();

        // Wait for thread to have started frame measurement
        startedThreadFrame.WaitConsume();
    }
    void EndFrame()
    {
        assert(!endThreadFrame.WasSignalled());
        assert(!endedThreadFrame.WasSignalled());

        // Tell thread to end frame measurement
        endThreadFrame.Signal();

        // Wait for thread to have ended frame measurement
        endedThreadFrame.WaitConsume();
    }


private:
    std::unique_ptr<std::thread> pthread_;
    std::atomic_bool bKill_ = false;

    struct ThreadSignal
    {
        std::atomic_bool bSignalled;
        std::mutex mutex;
        std::condition_variable cv;

        void Signal()
        {
            assert(!bSignalled);
            {
                std::unique_lock<std::mutex> _(mutex);
                bSignalled = true;
            }
            cv.notify_all();
        }

        bool WasSignalled()
        {
            return bSignalled;
        }

        void WaitConsume()
        {
            std::unique_lock unique(mutex);
            cv.wait(unique, [this]() { return bSignalled == true; });
            unique.unlock();
            bSignalled = false;
        }
    };

    ThreadSignal startThreadFrame;
    ThreadSignal endThreadFrame;

    ThreadSignal startedThreadFrame;
    ThreadSignal endedThreadFrame;
};

Can you spot what I'm doing wrong here? Or even a much better solution, I'd be open to it! It's rare but it hangs sometimes - one of the 'ThreadSignal' objects will have its bool as 'true', but will still be stuck - I guess there's a rare timing issue here.

Many thanks! Been tearing my hair out.

Solution

        std::unique_lock unique(mutex);
        cv.wait(unique, [this]() { return bSignalled == true; });
        unique.unlock();
        bSignalled = false;

this is wrong. Move assignment to bSignalled inside lock.

Basically, hands off the condition state outside of the condition mutex. There are a few narrow ways you can prove it legal, but before you do it write a proof and document it, because every such way I have seen legal has been amazingly fragile; the next person who touches your code can break it easily.

Changing that will fix your problem, unless I got it wrong.

Also on many platforms

       assert(!bSignalled);
        {
            std::unique_lock<std::mutex> _(mutex);
            bSignalled = true;
        }
        cv.notify_all();

is less efficient than

       assert(!bSignalled);
        {
            std::unique_lock<std::mutex> _(mutex);
            bSignalled = true;
            cv.notify_all();
        }

as this case is optimized by the OS (it knows the link between cv and mutex). Finally, one consumer means:

            std::unique_lock<std::mutex> _(mutex);
            bSignalled = true;
            cv.notify_one();

is correct. As consumers consume the signal, only one should awake.

   void Signal()
   {
        {
            std::unique_lock<std::mutex> _(mutex);
            bSignalled = true; // 1a
        }
        // 2a
        cv.notify_all(); // 3a
    }

    void WaitConsume()
    {
        std::unique_lock unique(mutex);
        cv.wait(unique, [this]() { return bSignalled == true; }); // 1b
        unique.unlock(); 
        // 2b
        bSignalled = false; //3b
    }

Thread alpha is at 2a.

Thread beta is at 2b.

bSignalled is true, alpha is about to notify all.

Thread beta hits 3b. bSignalled is now false.

Thread alpha hits 3a. It notifies all. Anyone who noticed the notification wakes up and see bSignalled as false. Message is lost.

There are probably other scenarios as well.