Search code examples
multithreadingsynchronizationproducer-consumer

Thread synchronization: separate data queue and event queue


Let's consider the classical producer-consumer problem, in which thread A puts messages on queue Q, while thread B pops them from Q. In order for thread B to pop any message from Q, it must be notified by thread A. Let's say that there exists another queue E on which thread B waits for many different events; thread A notifies B about adding a new message on queue Q by pushing event "msg_from_A" on E. This scenario makes sense if you assume that E is not polymorphic and cannot hold variable size elements (such as messages from Q and different messages generated by another thread). When B receives "msg_from_A" it looks into queue Q and pops one element. In my implementation, access to both these queues is protected by mutexes. So B does the following:

while (true) {
  event = wait(E)
  switch(event){
  case "msg_from_A":
    msg = pop(Q)
    do stuff with msg ...
    break
  case "other event":
    ...
  }
}

Whereas A does:

push(msg, Q)
push("msg_from_A", E)

So the question is: is this correct? Is it guaranteed that message msg is pushed on Q before event "msg_from_A" is pushed on E? For if these operations happen in reverse order, there is a race condition: B might wake up before any message is pushed by A on Q, resulting in an empty pop. Although I write the operations in that order, is it possible that the compiler or the CPU reorders them? Please note that queue operations such as push are implemented internally using a mutex, which is acquired before inserting the message and subsequently released once the insertion has completed. I hope my question is clear!


Solution

  • is it possible that the compiler or the CPU reorders them?

    That's the question, Right? If that's your question, then you can prevent the "re-ordering" by correctly using mutexes to protect the two queues.

    Rule 1: Whatever happens within a single thread must effectively happen in "program order." (that is to say, every thread in your program must behave as if its own statements were not re-ordered.)

    Rule 2: Whatever one thread effectively did before it unlocks a mutex must become effective for a second thread by the time the second thread subsequently locks the same mutex.

    If both of your queues are "thread-safe" (i.e., if each queue has its own private mutex, and all of the operations on the queue are protected by the mutex) then it's going to play out like this:

        Producer                         Consumer
        ------------------------------   ------------------------------
                                         (awaiting event from event queue)
        Lock msg queue mutex
        Push message to msg queue
        unlock msg queue mutex
    
        lock event queue mutex
        Push "check-msg-Q" event to
                         event queue
        Notify event queue               Wake, and attempt to...
    (1) unlock event queue mutex
    (2)                                  ...lock event queue mutex
                                         Remove "check-msg-Q" from event queue
                                         unlock event queue mutex
                                         lock message queue mutex
                                         remove message
                                         unlock message queue mutex
    

    According to Rule 1, everything that the producer did in program order has to have effectively happened in that thread by the time it reaches step (1), and according to Rule 2, everything that effectively happened in the producer thread must also have effectively happened for the consumer thread by the time the consumer thread completes step (2).


    You might think that if there was only one producer, and only one consumer, and if the consumer never, ever, even looked at the message queue except to pop one message for each check-msg-queue event it received,... You might think that if all of those conditions were met, your program would work even without any lock on the message queue.

    And it would work, but...

    ...If you pulled that stunt in a big company, and if it somehow got past code review, what happens next is; requirements change, and some other programmer comes along and relaxes one of those restrictions. Maybe they add a second producer. Maybe they make the consumer check the length of the message queue. Whatever. Now the program has a race condition. And eventually—maybe not until a long time later—it will surface as a bug that could take months to find and understand. (Don't ask me how I know!)