Search code examples
azureetwazure-diagnostics

ETW events in Azure diagnostics (SDK 2.5) are logged with incorrect / missing schema


I upgraded to Azure SDK 2.5 and switched to semantic logging with EventSources. Logging works locally with a custom EventListener.

When deployed, logs are written to a storage table, but only the EventId, Pid, Tid etc. are populated, the really interesting fields (Message, Task, Keyword, Opcode) are left blank.

The diagnostics infrastructure log is full of errors with regards to ETW, but I don't know what to make of them:

Failed to load backup EventSource manifest file C:\Resources\{13b7ec61-6424-d4d3-9972-a83e58d8d6bb}\directory\f71b19461fcf494d89d3717b3a13cadf.    something.WorkerRole.DiagnosticStore\WAD0103\Configuration\EventSource_Manifest_fe06b63d-39aa-5419-0529-18c4dacf4f68_Ver_20.backup.xml;  

EventSource    events will be logged without a proper schema until provider sends the manifest packets

Load manifest file failed for C:\Resources\{13b7ec61-6424-d4d3-9972-a83e58d8d6bb}\directory\f71b19461fcf494d89d3717b3a13cadf.something. WorkerRole.   DiagnosticStore\WAD0103\Configuration\EventSource_Manifest_fe06b63d-39aa-5419-0529-18c4dacf4f68_Ver_20.xml

Failed to manage manifest version for file C:\Resources\{13b7ec61-6424-d4d3-9972-a83e58d8d6bb}\directory\f71b19461fcf494d89d3717b3a13cadf. something.   WorkerRole.DiagnosticStore\WAD0103\Configuration\EventSource_Manifest_fe06b63d-39aa-5419-0529-18c4dacf4f68_Pid_3436.xml

Failed to process EventSource manifest event GUID:fe06b63d-39aa-5419-0529-18c4dacf4f68, event id:0xFFFE

Change in the number of events lost since the last sample: EventsCaptured=2 EventsLogged=1 EventsLost=0 

I do not use a manifest file and specify the EventSource via class / attribute name:

<EtwEventSourceProviderConfiguration scheduledTransferPeriod="PT3M" scheduledTransferLogLevelFilter="Information" provider="something.Core">
  <DefaultEvents eventDestination="CoreEvents" />
</EtwEventSourceProviderConfiguration>

I must be missing something, but I do not know what.

The remaining diagnostic services all work (infrastructure logs, performance counter etc.). The EventId that is being logged is the correct one, but all the important information of the log is missing, I suppose because of an incomplete configuration?


Edit: here is my EventSource code. I won't post the entire thing because it's quite large. I use another type that calls the EventSource methods and handles formatting of parameters (if the source is enabled in that level). Most method arguments are of type string, there are no objects or other complex types passed around (that handles the other type).

[EventSource(Name = "something.Core")]
public sealed class CoreEventSource : EventSource {
    private static readonly CoreEventSource SoleInstance = new CoreEventSource();

    static CoreEventSource() {}
    private CoreEventSource() {}

    public static CoreEventSource Instance {
        get { return SoleInstance; }
    }

    public static EventKeywords AllKeywords = (EventKeywords)(-1);

    public class Keywords {
        public const EventKeywords None = (EventKeywords)(1 << 1);

        public const EventKeywords Infrastructure = (EventKeywords)(1 << 2);
        [...]
    }

    public class Tasks {
        public const EventTask None = EventTask.None;

        // generic operations
        public const EventTask Create = (EventTask)11;
        public const EventTask Update = (EventTask)12;
        public const EventTask Delete = (EventTask)13;
        public const EventTask Get = (EventTask)14;
        public const EventTask Put = (EventTask)15;
        public const EventTask Remove = (EventTask)16;
        public const EventTask Process = (EventTask)17;
    }

    [Event(1, Message = "Initialization of {0} failed: {1}.", Level = EventLevel.Critical, Keywords = Keywords.Infrastructure)]
    public void CriticalInitializationFailure(string component, string details, string exception) {
        this.WriteEvent(1, component, details, exception);
    }

    [Event(2, Message = "[Role '{0}'] Startup: {1}", Level = EventLevel.Informational, Keywords = Keywords.Infrastructure)]
    public void RoleStartup(string roleName, string message) {
        this.WriteEvent(2, roleName, message);
    }

    [Event(3, Message = "[Role '{0}'] Stop failed: {1}.", Level = EventLevel.Error, Keywords = Keywords.Infrastructure)]
    public void RoleStopFailed(string roleName, string details, string exception) {
        this.WriteEvent(3, roleName, details, exception);
    }

    [Event(4, Message = "An unhandled exception occurred.", Level = EventLevel.Critical, Keywords = Keywords.Infrastructure)]
    public void UnhandledException(string exception) {
        this.WriteEvent(4, exception);
    }

    [Event(5, Message = "An unobserved exception occurred in a faulted task.", Level = EventLevel.Critical, Keywords = Keywords.Infrastructure)]
    public void UnobservedTaskException(string exception) {
        this.WriteEvent(5, exception);
    }
    [...]
}

Solution

  • Turns out there were quite a few problems with my EventSource. The first thing I'd recommend to anyone working with ETW is to use the Microsoft TraceEvent Library from NuGet, even if you use System.Diagnostics.Tracing, because it comes with a tool that will verify your EventSource code and notify you about problems.

    I had to fix the following:

    • EventSource names must not contain a period .
    • Task/Opcode pairs must be unique within an EventSource
    • One must not declare a None field in a custom Keywords or Tasks enumeration

    Hope this is of some use to anyone who encounters a similar problem.