Search code examples
linuxprocessposixidentifier

How to identify if a long-running process died?


I'm working on a daemon that communicates with several processes. The daemon can't monitor the processes all the time, but it must be able to properly identify if a process dies to release scare resources it holds for it.

The processes can communicate with the daemon, giving it some information at the start, but not vice versa. So the daemon can't just ask a process its identity.

The simplest form would be to use just their PID. But eventually another process could be assigned the same PID without my tool noticing.

A better approach would be to use PID plus the time the process started. A new process with the same PID would have a distinct start time. But I couldn't find a way how to get the process start time in a POSIX way. Using ps or looking at /proc/<pid>/stat seems not portable enough.

A more complicated idea that seems POSIX-compliant would be:

  • Each process creates a temporary file.
  • Locks it using flock
  • Tells my daemon "my identity is connected with this file".
  • Any time the daemon can check the temporary file. If it's locked, the process is alive. If it's not, the process is dead.

But this seems unnecessarily complicated.

Is there a better, or standard way?

Edit: The daemon must be able to resume after a restart, so it's not possible to keep a persistent connection for each process.


Solution

  • But I couldn't find a way how to get the process start time in a POSIX way.

    Try the standard "etime" format specifier: LC_ALL=C ps -eo etime= $PIDS

    In fairness, I would probably construct my own table of live processes rather that relying on the process table and elapsed time. That's fundamentally your file-locking approach, though I'd probably aggregate all the lockfiles together in a known place and name them by PID, e.g., /var/run/my-app/8819.lock. Indeed, this might even be retrofitted on to the long-running processes, since file locks on file descriptors can be inherited across exec().

    (Of course, if the long-running processes I cared about had a common parent, then I'd rather query the common parent, who can be a reliable authority on which processes are running and which are not.)