Search code examples
c#.netwindowsmultithreadingthread-safety

Execution stuck on ManualResetEvent.Set after suspending other thread


I've created a tool that finds and resumes suspended threads of a specific process (the reason behind that tool is irrelevant). To test the tool, I've created the following unit test:

// Arrange
var timeout = TimeSpan.FromSeconds(10);
var locker = new object();
var suspendEvent = new ManualResetEventSlim(false);
ulong threadId = 0;

var thread = new Thread(() =>
{
    lock (locker)
    {
        threadId = InteropAPI.TRGetCurrentThreadId();
    }

    Log($"Thread started: {threadId}");
    suspendEvent.Wait();
    Log($"Thread finished: {threadId}");
});
thread.Start();

for (var i = 0; i < 100; i++)
{
    lock (locker)
    {
        if (threadId != 0)
        {
            break;
        }
    }
    Thread.Sleep(10);
}

Assert.That(threadId, Is.Not.EqualTo(0));

// Act
Log("Suspending thread...");
InteropAPI.TRSuspendThread(threadId);
Log("Thread suspended");
suspendEvent.Set();
Log("Event set");

// Assert part omitted

The TRGetCurrentThreadId simply calls winapi GetCurrentThreadId while TRSuspendThread calls SuspendThread.

This version of the test works just fine, executed thousands of times. Here's a slightly cleaner version which tries to get rid of the threadId wait hack:

// Arrange
var timeout = TimeSpan.FromSeconds(10);
var locker = new object();
var suspendEvent = new ManualResetEventSlim(false);
var threadSetEvent = new ManualResetEventSlim(false);
ulong threadId = 0;

var thread = new Thread(() =>
{
    lock (locker)
    {
        threadId = InteropAPI.TRGetCurrentThreadId();
    }
    threadSetEvent.Set();

    Log($"Thread started: {threadId}");
    suspendEvent.Wait();
    Log($"Thread finished: {threadId}");
});
thread.Start();

threadSetEvent.Wait(timeout);
Assert.That(threadId, Is.Not.EqualTo(0));

// Act
Log("Suspending thread...");
InteropAPI.TRSuspendThread(threadId);
Log("Thread suspended");
suspendEvent.Set();
Log("Event set");

The problem with this version is, it gets stuck. Specifically, in the call to suspendEvent.Set(). The logs printed:

Thread started: 37348
Suspending thread...
Thread suspended

The thread gets suspended as expected, as confirmed via process explorer. Forcefully resuming the thread using process explorer makes the test finish and print the rest of the logs:

Event set
Thread finished: 37348

Initially, I thought that perhaps runtime moves the managed, test-running thread, onto a suspended system thread, however printing the TRGetCurrentThreadId in the main thread doesn't confirm that.

The native call stack for the test-running thread, at the time of deadlock, seems to be:

ntdll.dll!NtWaitForAlertByThreadId+0x14
ntdll.dll!TpWorkOnBehalfClearTicket+0x2c3
ntdll.dll!RtlEnterCriticalSection+0x254
ntdll.dll!RtlEnterCriticalSection+0x42

Stack of the suspended thread:

ntdll.dll!ZwCreateEvent+0x14
KERNELBASE.dll!CreateEventW+0x95
  1. Why does the execution stop at suspendEvent.Set()? Does the call to Set() require the waiting thread to not be in a suspended state or otherwise it can deadlock?
  2. Why does the initial version work and the refactored one doesn't? Is it merely due to a subtle timing difference that prevents race from happening?

Remarks:

  • Tests are never run in parallel
  • This test was executed in Unity engine, which uses a custom version of Mono

Solution

  • I think that threadSetEvent.Set() introduces a race between the calls to suspendEvent.Wait() and suspendEvent.Set().

    When your test thread calls suspendEvent.Set();, it tries to acquire the m_lock of the ManualResetEventSlim (source code).

    private void Set(bool duringCancellation) {
        IsSet = true;
    
        if (Waiters > 0) {
            lock (m_lock) {
                Monitor.PulseAll(m_lock);
            }
        }
    

    However, the suspended thread might have already acquired this lock with the call to Wait() before it was suspended:

    lock (m_lock) {
        while (!IsSet) {
            Waiters++;
            if (IsSet) // This check must occur after updating Waiters.
            {
                Waiters--; // revert the increment.
                return true;
            }
    
            try {
                // ** the actual wait **
                if (!Monitor.Wait(m_lock, realMillisecondsTimeout))
                    return false; // return immediately if the timeout has expired.
            } finally {
                // Clean up: we're done waiting.
                Waiters--;
            }
        }
    }
    

    The call to Monitor.Wait() will release the lock, so maybe the thread is suspended before the call to it - causing the dead-lock. When you resume the thread it releases it, so the event can be set by the test thread.

    I think you can't really remove the need for "a hack" in that you might need to non-blockingly wait, but maybe make things a bit cleaner with the use of a Barrier because it's SignalAndWait doesn't take a lock at least on Signal(Set) part.:

    // Arrange
    var timeout = TimeSpan.FromSeconds(10);
    
    var threadSetEvent = new ManualResetEventSlim(false);
    var barrier = new Barrier(2);
    
    uint threadId = 0;
    
    var thread = new Thread(() => {
        threadId = InteropAPI.GetCurrentThreadId();
        threadSetEvent.Set();
    
        Console.WriteLine($"Thread started: {threadId}");
        barrier.SignalAndWait();
        Console.WriteLine($"Thread finished: {threadId}");
    });
    thread.Start();
    
    threadSetEvent.Wait(timeout);
    
    // Act
    Console.WriteLine("Suspending thread...");
    var spinWait = new SpinWait();
    // make sure SignalAndWait() is called from the other thread
    // AND in a Waiting state
    // before we Suspend it
    while (barrier.ParticipantsRemaining != 1) {
    // maybe also add || thread.ThreadState == ThreadState.WaitSleepJoin
        spinWait.SpinOnce();
    }
    InteropAPI.SuspendThreadById(threadId);
    Console.WriteLine("Thread suspended");
    // this unblocks the other thread
    // which when resumed will be able to proceed
    barrier.SignalAndWait();
    Console.WriteLine("Event set");
    InteropAPI.ResumeThreadById(threadId);