Search code examples
loggingarchitecturesoftware-design

What are ways to solve the logging haystack problem in larger systems?


I have always thought that the more logging you have the better for debugging/troubleshooting. It just makes it so much easier given you are logging the correct things and not logging noobish things. I have always done this on the smaller scale. How would you do this in larger systems? Would it be worth it, or would it be more beneficial to just log what is needed to identify there is a problem and what the problem is but not very specific details.

Logging haystack problem being, so many log lines/rows that its harder to track the full request or multiple requests. No separation in requests, just line/row after each other.

On the small scale I have done the following successfully and I do like the approach. See below, just a quick high level design... Yes I am a .net developer :).

  • Definitely a custom logger with a global log context.
  • A log file per request with the file name as RequestName_Timestamp_GUID.
  • The folder has an _exceptions file which every request that has an exception has a log line for identification.
  • It has a log roller, and file watcher as well.
  • Any request with an error the log file begins with !~~RequestName_Timestamp_GUID.
  • This makes the error requests easy to find as windows explorer puts these at the top.

FYI: Not exactly sure the best place for this question.


Solution

  • The challange how to debug large scale system is pretty broad so it's better to break it down to several separated pieces. You

    Collect logs from all the systems

    One of the common way is to use existing systems for centralized log collection and managment. You can use commerical solution like Splunk or open source stack like ELK (Elastic Logstash Kibana). These systems have usually main 3 parts

    1. Log collection - listen to the log outputs and forward them to the centralized location. It can be done in several different ways depending on the needs and used technology.
    2. Centralized storage - Some DBMS system, which allows to store large number of data and index them so they can be searched efficiently
    3. GUI - To allow users to search the DBMS and setup dashboards and alerts

    https://www.elastic.co/what-is/elk-stack

    Track the requests across multiple sysems

    The best approach for most setups is to have generate unique correlationId, which is then carried across all the systems and used as part of the log message.

    https://dzone.com/articles/correlation-id-for-logging-in-microservices

    Logger configuration

    For the debugging the most important is to actually generate the log messages. This topic is fairly complex on it's own. As too much logging make the code not very clear to read, too much logging reduce the performance, incomprehensible log messages makes it hard to understand the problem.