Search code examples
unit-testinglanguage-agnosticdeadlockverification

Sensible strategy for unit testing expected and non-expected deadlock behavior


I'd like some ideas about how I should test some objects that can block, waiting for another participant. The specific unit to be tested is the channel between the participants, The the participants themselves are mock fixtures for the purposes of the tests.

It would be nice to validate that the participants do deadlock when they are expected to, but this is not terribly important to me, since what happens after the deadlock can reasonably be described as undefined.

More critical would be to verify that the defined interactions from the participants do not deadlock.

In either case, I'm not really sure what the optimal testing strategy should be. My current notion is to have the test runner fire off a thread for each participant, sleep for a while, then discover if the child threads have returned. In the case they have not returned in time, assume that they have deadlocked, and safely terminate the threads, and the test fails (or succeeds if the deadlock was expected).

This feels a bit probabalistic, since there could be all sorts of reasons (however unlikely) that a thread might take longer than expected to complete. Are there any other, good ways of approaching this problem?

EDIT: I'm sure a soundness in testing would be nice, but I don't think I need to have it. I'm thinking in terms of three levels of testing certainty.

  • "The actual behavior has proven to match the expected behavior" deadlock cannot occur
  • "The actual behavior matched the expected behavior" deadlock did not occur in N tests
  • "The actual behavior agrees with the expected behavior" N tests completed within expected deadline

The first of course is a valuable test to pass, but ShiDoiSi's answer speaks to the impracticality of that. The second one is significantly weaker than the first, but still hard; How can you establish that a network of processes has actually deadlocked? I'm not sure that's any easier to prove than the first (maybe a lot harder)

The last one is more like what I have in mind.


Solution

  • The only way to reliably test for deadlocks is to instrument the locking subsystem to detect and report them. The last time I had to do this, we built a debug version of it that recorded which threads held which locks and checked for potential deadlocks on every lock-obtain call. It can be a heavyweight operation in a system with a lot of locking going on, but we found it to be so valuable that we reorganized the subsystem so we could turn it on and off with a switch at runtime, even in production builds.