My boss said logs in current state are not acceptable for the customer. If there is a fault, a dozen of different modules of the device report their own errors and they all land in logs. The original reason of the fault may be buried somewhere in the middle of the list, may not appear on the list (given module being too damaged to report), or appear way late after everything else finished reporting problems that result from the original fault. Anyway, there are few people outside the system developers who can properly interprete the logs and come up with what actually happened.
My current task is writing a module that does a customer-friendly fault-reporting. That is, gather all the events that were reported over the last ~3 seconds (which is about the max interval between origin of the fault occurring and the last resulting after-effects), do some magic processing of this data, and come up with one clear, friendly line what is broken and needs to be fixed.
The problem is the magic part: how, given a number of fault reports, to come up with the original source of the fault. There is no simple list of cause-effect list. There are just commonly occurring chains of events displaying certain regularities.
Examples:
There is no comprehensive list of rules as to what causes what. The rules will be added as new kinds of faults occur "in the wild" and are diagnosed and fixed. Some of them are heuristics - if this error is accompanied with these errors, then the fault is most likely this. Some faults will not be solved - a bland list of module reports will have to suffice. Some answers will be ambigous, one set of symptoms may suggest two different faults. This is more of a "best effort" than a "guaranteed solution" one.
Now for the (overly general and vague) question: how to solve this? Are there specific algorithms, methods or generalized solutions to this kind of problem? How to write the generalized rulesets and match against them? How to do the soft-matching? (say, an input module broke right in the middle of an emergency halt, it's a completely unrelated event to be ignored.) Help please?
In all honesty, I would just write a series of simple rules and be done with it. It will be a pain maintenance wise, but getting this right may be time consuming and brittle.
If you insist, I would approach this by having each error drop some sort of symbol/token for each error code - you'll make this much harder if you try to do some bag of words/keyword matching. You would then input the outputted tokens in some sort of classifier.
At heart, you need some sort of rules engine - be it fuzzy or exact. The first thing that comes to mind is a hand-built Bayesian network. This would allow for fuzzy matching as you would calculate the most probable 'report' as a function of the tokens you receive. It also allows you to set a threshold for token groups that aren't really indicative of anything by specifying the minimum probability to return an answer.
You could also train a Bayes net or other type classifier, but you'll need quite a bit of data that you've manually labeled (token1,token2,token3->faultxyz) and it might be more accurate to do it yourself.