multithreading pthreads mutex deadlock race-condition

Deadlock free mutexes on directed graph

Suppose I'm writing a multithreaded server. The server maintains a directed graph of nodes, where each node may point to and be pointed to by many others. Clients may add new nodes, add new pointers between nodes, and remove either as well. Reference counting is used for garbage collection. If two clients try to perform an operation that could lead to a race condition, one should be blocked. The frequency of blocking should remain uniform between clients and between nodes.

How do I use the mutex primitive to accomplish this and be free of races and deadlock? It doesn't seem like I can create a simple global right-of-way ranking because that would favor some clients or nodes.

If it changes the answer, it's implemented in C using the POSIX threads interface. Here's one way a node could be defined:

struct node {
    pthread_mutex_t mut;
    int refs; /* reference counting */
    void *content; /* immutable */
    struct node *links[];
}

I've tried just using recursive mutexes to lock each node that needs to be manipulated (supposing that for example a client links a node to itself) but I'm not sure this is correct.

Solution

Answer: very carefully

The ref count will need a separate mutex from the list of links. (Or just use atomic ops on the ref count.) Otherwise you will end up holding two mutexes at once that can deadlock. The mutex for the ref count doesn't "count" as a potential deadlock because if it only modifies the ref count, then it is a end-point - you never grab another lock while holding it, and thus it can't cause a deadlock. Lock, incr, unlock. (or better - atomic incr) Nothing else while holding the ref-count lock.

Before "using" a node you need to increment its ref-count. But before you do that you need to know the node isn't about to be deleted - otherwise the refs variable might be gone. So you need to grab a lock on one of the lists that point to that node - by holding that lock, you know the node can't be completely deleted - someone is referring to it.

But you can't grab a node-list without incrementing a ref-count. Chicken and Egg. ie You are somehow "on" node C, and want to use node D. First lock the list in node C, then it is safe to traverse to D and increment D's ref count. But how did you get "on" to C in the first place?

Well, you started at A, obviously. Or "head" that points to A. ie you need somewhere to start, somewhere to "stand". Maybe a default node that never goes away even.

Anyhow, assuming you can get to node N, and N points to M, to get to M you

lock N.links
incr M.refs (lock refs; inc refs; unlock refs)
unlock N.links
(even if N-to-M now gets unlinked, you've incremented M, so it won't go away)
move to M

Other operations are done similarly.

This, of course, assumes an individual thread is OK with being on a piece of the graph that is actually being disconnected elsewhere at the "same time". ie once you reach M, and use its data or whatever, it may now be disconnected from the graph (but not deleted as you incr'd the refcount). This should be OK, because "same time" has little meaning in threading.