I am writing a server system that runs in the background. In very simplified terms it has its own scripting language, which means that a process can be written in that language to run on its own, or it can call another process, etc. I am converting this system from a trivial PHP cron-job in which only one instance is permitted at a time to a set of long-running processes managed by Supervisor.
With that in mind, I am aware that these processes can be killed at any time, either by myself in development, or perhaps by Supervisord in the normal course of stopping or restarting a worker. I would like to add some proper signal handling to ensure that workers tidy up after themselves, and log where a task was left in an interrupted state where appropriate.
I have worked out how to enable signal handling using ticks and pcntl_signal()
, and my handling currently seems to work OK. However, I would like to test this to make sure it is reliable. I have written some early integration tests but they don't feel all that solid, mainly because during development there were all sorts of weird race-condition issues that were tricksy to pin down.
I'd like some advice or direction on how to send kill signals in PHPUnit tests, with a view to improving confidence that my sig handling is robust. My present strategy:
system()
command in the PHPUnit test. My command is similar to php script.php > $logFile 2>&1 &
i.e. redirect all output to a log file and then push it to the background, so the test method can monitor itusleep
ing between scansusleep
ing between scans, and issuing a kill <pid>
when it is readyusleep
ing again to avoid hammering the databaseOf course, with all this waiting/checking, it feels a bit ropey, and quite ripe for race conditions of all sorts. My current feeling is that the tests will fail around 2% of the time, but I've not been able to get the test to fail for a day or so. I plan to do some immersion testing, and if I get any failures from that I'll post that here.
I wonder if I can simplify it by asking the system under test to kill
itself, which will remove two levels of wait-checking (one to wait for the PID, and another to wait for the database to enter the correct state before the kill command)†. That would still leave the wait-check loop after the kill is issued, but I may yet find that having that one check is not a problem in practice.
That said, I am conscious that my whole approach may be ham-fisted, and there is a better approach to do this sort of thing. Any ideas? At present my thinking is just to increase my wait timeouts, in case PHPUnit is introducing any strange delays. I'll also see if I can get a failure case to examine the logs.
† Ah, sadly it won't simplify things much. I just tried this on an simple signal integration test I regard as reliable, and since the backgrounded system()
returns immediately, it still has to loop-wait to identify the right log record, and then for the right post-kill result. However, it no longer has to wait for a PID to be written to a temp file, so that is at least one loop eliminated.
As I mentioned in the question, the first reliability change I tried was to inject the ability for worker tasks to run kill
on themselves. In my case this was built into the system, but readers may find that writing a child test class and changing their DI config would be a convenient way to do it.
This seems to have improved reliability a good deal. Originally, there were several wait loops in the tests, and the test would have to run the kill
at the right moment:
kill
The issue may have been in (2) - if this is too short then the kill
may sometimes arrive too late, and even if a reliable max wait time is found, if the CPU is under unexpected load then it may still be prone to failure.
I have now written a quick script to repeatedly run the PHPUnit tests, either for 200 iterations, or to the first failure, whichever comes first. This now passes 200 iterations, so for the time being I'll regard the test reliability as having gone up. However I will update here if this changes - perhaps running the tests with a high nice
will trigger a failure.
Other answers are still most welcome.