I'm running some applications on EC2 spot instances. Such instances can be killed by Amazon with no notice.
In the shutdown process, processes are killed in some order. We have monitoring/recovery programs that should behave differently based on whether the server is shutting down or the process just crashed. (specifically we don't want to do anything if the server is actually shutting down)
How can I detect in the recovery process (if it is still alive) that processes were killed because of a shutdown?
(More system details: I'm running unknown/untrusted/etc code in a sandbox that doesn't modify external state. Generally if sandboxed code crashes, it is fault of author of the untrusted code and we will not rerun it. But if the sandboxed code is terminated due to the VM shuting down or failing, we need to rerun it on another instance. The problem I'm having right now is that the user's code is terminated first so the monitoring program incorrectly believes the crash is user error.)
Run an agent on each machine that spawns sandbox child-processes. The agent runs your code that is "crash proof", and the sandbox code runs user code which could crash.
The monitoring system that is in charge of starting a new machine with a new sandbox process checks which processes have been killed (both the agent and sandbox process or only the sandbox child process).
It does that by opening a TCP connection (RMI/RPC/HTTP) to the agent querying about its child processes. If the agent responds - the machine is still running, and it can be asked about its child sandbox processes. If the agent does not respond - the machine is suspect of being terminated.
The agent is also in charge of restarting the child sandbox process on the same VM in case it crashes.
Use a look-up service (such as Zoo Keeper) to keep track of which processes sent heartbeat keep-alive. If the agent is alive then the machine is still running, if the agent is not alive, then it is not running.
Poll the EC2 APIs to determine if the machine is in running or terminated state.