Search code examples
azure.net-coreazure-application-insights

Distributed correlation in Application Insights end-to-end transactions: are there limits on nesting levels?


I have five microservices (running in Docker containers on our local server) that all send their telemetry to Application Insights in Azure. They communicate over RabbitMQ and this communication has been wired up to send operation IDs in messages' headers, so that the operations in distributed microservices can set their contexts to proper parent operation IDs which is what I thought would be enough to successfully create a correlation between requests that the AI would use to paint us the complete picture of events. This works only partially and I am trying to understand, why.

The first incoming request hits service A which then sends messages to services B and C, who set their parent operation IDs to A's operation ID. Service C then activates service D, and service D activates service E, and the similar mechanism sets all respective parent operation IDs. So operations are linked like this:

A.OperationID <- gets initialized
A.ParentOperationID <- (gets set == A.OperationID automatically?)

B.OperationID <- gets initialized
B.ParentOperationID <- A.OperationID

C.OperationID <- gets initialized
C.ParentOperationID <- A.OperationID

D.OperationID <- gets initialized
D.ParentOperationID <- C.OperationID

E.OperationID <- gets initialized
E.ParentOperationID <- D.OperationID

I get all 5 requests in Transaction Search and I can see all the IDs properly set according to the description above, so all is seemingly correct. I then expect AI to correlate all 5 requests into same end-to-end transaction, but what happens instead is that only A, B and C are joined together, while D and E are separate and not correlated with either the group of the first three or with each other.

I see that A, B and C all have same ParentID and I am starting to suspect that I was wrong thinking AI would be able to group together requests in a transitive manner, using several levels of parent-child relationship between requests. Can anybody confirm this suspicion or maybe tell me what it is that I do not understand here?

The code that manages requests telemetry and sets the relationships looks something like this:

_listenerService.StartListening(async (input, operationId) => //operationId is provided by messaging framework here
{
    using (var operation = _telemetryClient.StartOperation<RequestTelemetry>(nameof(ThisServiceWorker)))
    {
        operation.Telemetry.Context.Operation.ParentId = operationId;
        operation.Telemetry.Properties.Add("CustomProperty", input.ImportantValue);
                    
        await _executiveService.DoWorkAndSendMessageDownTheLine(input, operation.Telemetry.Context.Operation.Id, stoppingToken);
    }
});

Solution

  • Although I still don't understand a lot in the AI SDK, I think I have resolved the problem that I asked about here:

    There are three IDs to each telemetry event - not two: Operation ID, Parent ID and Telemetry ID. In the Telemetry objects they are represented by, respectedly:

    myOperation.Telemetry.Context.Operation.Id
    myOperation.Telemetry.Context.Operation.ParentId
    myOperation.Telemetry.Id
    

    What I did not understand was that Telemetry.Context.Operation.ParentId refers to Telemetry.Id (and NOT to Telemetry.Context.Operation.Id).

    I have the impression that the algorithm that builds the end-to-end transaction in Application Insights' Transaction Search uses both the OperationId and the ParentId -> TelemetryId relationship to group the events together. I might be wrong here, but I think it is enough for the events to either have same OperationId to be put into the same timeline or have the child-parent relationship with each other (in that case both operation IDs are listed in the transaction's header).

    I would still like to know what is the "industry approved" way of corelating remote services via a message bus, I guess I would need to build a test example that uses the Azure Service Bus and study what that results in, when everything is instrumented natively.