Search code examples
rubymultithreadingthread-safetysignalsdeadlock

Simple thread conditional variable example giving deadlock in ruby


Hi I am toying around with threads and condition variables in ruby and I'm getting some very confusing results which don't make sense. I am following the ConditionVariable example from the ruby docs and everything seems to go as planned:

mutex = Mutex.new
resource = ConditionVariable.new

waiting_thread = Thread.new {
  mutex.synchronize {
    puts "Thread 'a' now needs the resource"
    resource.wait(mutex)
    puts "'a' can now have the resource"
    "a can now have the resource"
  }
}

signal_thread = Thread.new {
  mutex.synchronize {
    puts "Thread 'b' has finished using the resource"
    resource.signal
  }
}

This when running this code I get very expected output:

=> Thread 'a' now needs the resource
=> Thread 'b' has finished using the resource
=> 'a' can now have the resource

However the MOMENT I change it up a bit to join or get the value from the waiting_thread, it blows up with a Deadlock fatal error.

waiting_thread.value
signal_thread

Outputs:

= Failure/Error: waiting_thread.value -- No live threads left. Deadlock?

I can vaguely understand what is happening -- both are trying to synchronize on the same mutex when the waiting_thread is indefinitely locked.

But in that case, why does the initial code work flawlessly giving the put statements in an expected asynchronous result?

This is somewhat important not only for my understanding but to toy around with concurrent testing. How can I use join and value with ConditionVariables to produce what I'm looking for?


Solution

  • I think the code from the Ruby docs can be a bit misleading, because it does not tell you that sending a signal does not buffer anywhere if the receiver is not waiting for it.

    So the situation which will result in a deadlock will happen as follows:

    1. signal_thread enters critical section and calls resource.signal. This signal will be lost.

    2. signal_thread is done, and exits.

    3. waiting_thread enters critical section and calls resource.wait. It's now locked waiting for a signal that never comes.

    4. All threads are locked or inactive. No more live threads, therefore no one is able to wake up waiting_thread --> deadlock error.

    You can get a deadlock error randomly on the sample code, depending on your CPU, OS, and the position of the sun or moon, if you just keep running it, because the order of the execution of signal_thread and waiting_thread is not deterministic. The order is random, therefore a deadlock may or may not happen, but it CAN happen depending on the execution order.

    Now how do you solve it? Well, you need to guarantee waiting_thread waits, before signal_thread signals. We can do this using a Queue, like so:

    mutex = Mutex.new
    resource = ConditionVariable.new
    
    sync_queue = Queue.new
    
    waiting_thread = Thread.new {
      mutex.synchronize {
        puts "Waiting thread sending sync message..."
        sync_queue << 1
    
        puts "Thread 'a' now needs the resource"
        resource.wait(mutex)
        puts "'a' can now have the resource"
        "a can now have the resource"
      }
    }
    
    signal_thread = Thread.new {
      puts "Signal thread waiting for sync..."
      # signal_thread will sleep here, until there is something in the queue to pop.
      # This guarantees the right execution order. 
      sync_queue.pop
    
      mutex.synchronize {
        puts "Thread 'b' has finished using the resource"
        resource.signal
      }
    }
    
    waiting_thread.value
    

    Now the code is deterministic, and waiting_thread will always wait before signal_thread signals, and the code will work as expected.

    You just have to be aware that the signal call of a condition variable goes up in smoke if nobody is waiting on the other end. I think this important information is missing from the docs.

    In addition to that the resource example is not really a very good example of checking if a resource is available in a critical section, because of this problem. If signal_thread already used the resource, then waiting_thread will never know it.

    In a real situation there needs to be additional data shared between the threads, so that one thread can check if a resource is in use, and only THEN wait for a signal. If the resource is not already in use, then waiting for the signal is not needed, and in fact should not be done at all.

    I.e. the ConditionVariable should not be used for checking resource state, only for signaling. In that case we are using condition variables more appropriately.