How to analyze logs in a distributed system？

When an unexpected behavior occurs in a distributed system(like raft nodes), the logical trend of the request or data flow usually can only be analyzed by logs. However, due to the distributed systems, this is difficult. I found that there are tools like shiviz that can visualize requests or data flow through logs, but require modification of the source code. Are there any other similar invasive tools?

Solution

There are two major approaches. The one is to have a tool which can go to every server and search their logs. The other option is to have a central location for logs and all nodes push their logs to that storage - this is the way how AWS CloudWatch works.

In either case, from an operator point of view, there is a tool where they can search all logs.

The second part of your question - how to make this analysis effective.

First of all, logs should be of a good quality. This is a naive thing to say, but it is very important. I can't count how many times I analyzed detailed, but useless logs.

The second challenge - how to analyze processes which span across several nodes. This is more complicated. There are two main features here:

how to find all logs related to the same "event" - e.g. let's say an api call is resulted in 5 services being called - how can we trace this call across these services. Typical solution here is to generate unique request id on the first service and then propagate this id through all services.
how to reassemble the order of calls across nodes. From "theoretical" point of view - this problem is about Total Order - we need to be able to take any two log events and say which one happened first. Here we can't use timestamps as they are not accurate enough. Luckily for us there is a well known and simple algorithm to handle this: Lamport timestamp. Of course, the developer has to add it to the code to make it working. It could be either service code, or the log agent code (log agent is that tool which aggregates all logs). Worth mentioning, that total order may be an overkill if your distributed system has a tree like call structure, e.g. service A always receives requests form users and then calls service B and C - in this case carrying over the request id is enough - as you know the order already. Total order is needed in cases like Raft, where it is not always clear who calls whom.