Search code examples
.netloggingetwetw-eventsource

Risk of missing events from ETW logging with EventSource


I'm instrumenting my .NET 4.5 applications to emit ETW events using the EventSource class. The goal is to be able to capture some of these events (the Error level events) for error logging.

After doing some reading and testing, I am concerned about the reliability of this approach to error logging, specifically regarding the possibility of dropped or missing events. If my error logging isn't working I need the application to shut down (in my case it is unsafe for it to run with unreported errors). When using ETW and EventSource, how can I be certain that my errors are getting properly recorded?

Obviously part of the answer will depend on what is listening to the events. In my case I plan to use the "Semantic Logging Application Block" from the latest MS Enterprise Library.

Here is one source where Microsoft talks about the possible causes of missed events: About Event Tracing

There they list these possible causes of Missing Events

  • The total event size is greater than 64K. This includes the ETW header plus the data or payload. A user has no control over these missing events since the event size isconfigured by the application.

  • The ETW buffer size is smaller than the total event size. A user has no control over these missing events since the event size is configured by the application logging the events.

  • For real-time logging, the real-time consumer is not consuming events fast enough or is not present altogether and then the backing file is filling up. This can result if the Event Log service is stopped and started when events are being logged. A user has no control over these missing events.

  • When logging to a file, the disk is too slow to keep up with the logging rate.

To see if these concerns were somehow mitigated using the EventSource class (like, does it truncate large payloads somehow) I did some testing. I tried to log long strings and it failed for me between 30,000 and 35,000 characters (right in line with the 64KB maximum event payload). It just silently does nothing from what I can tell for the too-large strings, no events at all in my Semantic Logging Application Block log. Events before and after were written like usual.

So anytime I have a string in my payload I have to pass it through some truncator? Will I need to manually avoid generating events "too fast" as well (and how would that be possible)?

Microsoft Patterns and Practices are supposed to lead us to good ... patterns and practices ... so maybe I'm just missing something here.

Update:

Well apparently there is some notice in the consuming application for the "Events too fast" condition. I received this today for the first time:

Level : Warning, Message : Some events will be lost because of buffer overruns or schema synchronization delays in trace session: Microsoft-SemanticLogging-Etw-svcRuntime

And then when closing the session:

Level : Warning, Message : The loss of 1 events was detected in trace session 'Microsoft-SemanticLogging-Etw-svcRuntime'.

Update2:

The Enterprise Library Developers Guide describes the behavior I just mentioned.

You should monitor the log messages generated by the Semantic Logging Application Block for any indication that the have buffers overflowed and that you have lost messages. For example, log messages with event ids 900 and 901 indicate that a sink’s internal buffers have overflowed; in the out-of-process scenario, event ids 806 and 807 indicate that the ETW buffers have overflowed. You can modify the buffering configuration options for the sinks to reduce the chance that the buffers overflow with your typical workloads.

My question remains, can I use semantic logging while ensuring my application does not run if errors are being dropped? Normal tracing events could be dropped ...

My current thought is to log "critical" errors with a separate class using old-fashioned logging techniques, and keep less critical errors (as well as debug type events) going through the ETW pipeline. That wouldn't be too bad really ... I might post that as a solution if I can't find a better suggestion.

Update 3:

The "missing events" warning that I received had nothing to do with buffer overruns, turns out this is the message you get if you pass a null string as a payload value.


Solution

  • The EventSource class comes in two versions, one included with the .NET Framework and another in the NuGet package Microsoft EventSource Library. I assume that you use the NuGet package because it contains newer code.

    The constructor for the EventSource base class has an overload that takes a boolean argument throwOnEventWriteErrors with the following documentation (NuGet package version 1.0.26.0):

    By default calling the 'WriteEvent' methods do NOT throw on errors (they silently discard the event). This is because in most cases users assume logging is not 'precious' and do NOT wish to have logging failures crash the program. However for those applications where logging is 'precious' and if it fails the caller wishes to react, setting 'throwOnEventWriteErrors' will cause an exception to be thrown if WriteEvent fails. Note the fact that EventWrite succeeds does not necessarily mean that the event reached its destination only that operation of writing it did not fail.

    Unfortunately, the last sentence contains a caveat emptor but if you look into the source code for the EventSource you can see that the underlying return codes from the OS calls are used to throw different exceptions for NoFreeBuffers and EventTooBig (and other errors).

    So if you turn on throwOnEventWriteErrors you will get exceptions if the EventSource class is unable to deliver the event to ETW. However, if ETW fails for another reason you wont get any exception but if you ensure that your ETW channels are configured correctly that should rarely if ever happen. However, as you cannot tolerate losing any error events you should probably test extreme error cases to make sure that ETW behaves as you expect.