Search code examples
language-agnosticprocess-monitoring

Best practice to monitor program life


I want to hear your opinion about program life monitoring.

This is the scenario. You have a simple program which normally works, that means that it's well written, exception are handled and so on.

How will you operate if you want to ensure that this program works FOREVER?

No external tools like crontab are available, but any overhead can be added.

Using another program that continuously "pings" the main program? Touching a file and check with another program for the file modification?

And how do you assure that this second program always works?

So, come on, tell me which are your opinion or best practice in this context!

As footnote, I've to write this program in Python, but it's a general purpose question!


Solution

  • In embedded systems, what is often done is a watchdog module.

    A watchdog checks some location (could be a file, could be a memory location, whatever), and restarts the system under examination if the location does not meet criteria.

    So you might have your program under probe do is to write some programname_watchdog file with an epoch stamp periodically. This would be part of the regular loop.

    Then your watchdog (in a totally different process) would check the file. If the date listed was sufficiently outdated, the other program would be killed and restarted, since it would be deemed to have critically malfunctioned(either hung or crashed). Note that your watchdog will have some simple logic, so its chances of failing are much lower.

    I'm positive there are other ways to accomplish this as well. This is just one way.

    edit: You have to consider the stack your system is built on. The more external dependencies you have, the more risk of failure. You also have to consider a formal proof of program correctness if you are looking for perfect operation.

    The question really becomes what you are expecting from your system; what sort of failures are unacceptable and what sort of failures are expected so you can compensate for them.

    This question becomes a proof-hardware-software co-design issue very fast (and expensive, too). I'm curious to see what you are doing and what your solution is.