Search code examples
algorithmlanguage-agnosticartificial-intelligencepattern-matchingdisaster-recovery

chain of events analysis and reasoning


My boss said logs in current state are not acceptable for the customer. If there is a fault, a dozen of different modules of the device report their own errors and they all land in logs. The original reason of the fault may be buried somewhere in the middle of the list, may not appear on the list (given module being too damaged to report), or appear way late after everything else finished reporting problems that result from the original fault. Anyway, there are few people outside the system developers who can properly interprete the logs and come up with what actually happened.

My current task is writing a module that does a customer-friendly fault-reporting. That is, gather all the events that were reported over the last ~3 seconds (which is about the max interval between origin of the fault occurring and the last resulting after-effects), do some magic processing of this data, and come up with one clear, friendly line what is broken and needs to be fixed.

The problem is the magic part: how, given a number of fault reports, to come up with the original source of the fault. There is no simple list of cause-effect list. There are just commonly occurring chains of events displaying certain regularities.

Examples:

  • short circuit detected, resulting in limited operation mode, the limited operation does not remove the fault, so emergency state is escalated, total output power disconnected.
  • safety line got engaged. No module reported engaging it within 3s since it was engaged, so an "unknown-source or interference" is attributed as the reason of system halt.
  • most output modules report no output voltage. About 1s later the power supply monitoring module reports power is out, which is the original reason.
  • an output module reports no output voltage in all of its output lines. No report from power supply module. The reason is a power line disconnected from the module.
  • an output module reports no output voltage in one of its output lines. No other faults reported. The reason is a burnt fuse.
  • an output module did not report back with applying received state. Shortly after, control module reports illegal state or output lines, (resulting from the output module really not updating the state in a timely manner.) The cause is the output module (which introduced the fault), not the control module (which halted the system due to fault detected).
  • a fault of input module switches the device to backup-failsafe mode. An output module not used so far, which was faulty gets engaged in this mode and the fault mode gets escalated to critical. The original reason is not the input, which is allowed to report false-positives concerning faults, but the broken backup output which aborted the operation.
  • there is no activity of any kind from an output module, for the last 2 seconds. This means it's broken and a fault mode must be entered.

There is no comprehensive list of rules as to what causes what. The rules will be added as new kinds of faults occur "in the wild" and are diagnosed and fixed. Some of them are heuristics - if this error is accompanied with these errors, then the fault is most likely this. Some faults will not be solved - a bland list of module reports will have to suffice. Some answers will be ambigous, one set of symptoms may suggest two different faults. This is more of a "best effort" than a "guaranteed solution" one.

Now for the (overly general and vague) question: how to solve this? Are there specific algorithms, methods or generalized solutions to this kind of problem? How to write the generalized rulesets and match against them? How to do the soft-matching? (say, an input module broke right in the middle of an emergency halt, it's a completely unrelated event to be ignored.) Help please?


Solution

  • In all honesty, I would just write a series of simple rules and be done with it. It will be a pain maintenance wise, but getting this right may be time consuming and brittle.

    If you insist, I would approach this by having each error drop some sort of symbol/token for each error code - you'll make this much harder if you try to do some bag of words/keyword matching. You would then input the outputted tokens in some sort of classifier.

    At heart, you need some sort of rules engine - be it fuzzy or exact. The first thing that comes to mind is a hand-built Bayesian network. This would allow for fuzzy matching as you would calculate the most probable 'report' as a function of the tokens you receive. It also allows you to set a threshold for token groups that aren't really indicative of anything by specifying the minimum probability to return an answer.

    You could also train a Bayes net or other type classifier, but you'll need quite a bit of data that you've manually labeled (token1,token2,token3->faultxyz) and it might be more accurate to do it yourself.