Search code examples
javaspringrabbitmqspring-rabbitzipkin

In instrumentation-spring-rabbit, why does brave remove the headers of the message?


In instrumentation-spring-rabbit module, brave is extracting and removing headers, why?

I explored other instrumentation (spring-web, httpclient, okhttp3, grpc, and others) brave never remove the headers - which holding the tracing keys/extras - from the original message.

Removing the headers has a side effect where the retry interceptor - already added by spring-rabbit -is trying for the second time to process the message, but because brave has removed the headers in the first retrial so it won't find it in the subsequent retrials.


Solution

  • messaging tracing is different than typical RPC tracing in two major ways. Because it is different, comparing to RPC isn't the best way to figure out a road ahead. I'll mention the couple things here in brief which are mostly in the slide deck I made on the topic.

    1. In messaging, there is often no thread context passed between the consumer and the message processor. This is unlike RPC where there's usually a handoff on at least the request side.
    2. When we have a thread context, we should use it to establish parent information (this is the case in the rabbit processing). However, that's often not the case. So, we often re-serialize headers on the message when we don't know the messaging processing abstraction.

    In the case of your example, you are talking about spring-rabbit which during the processing block is using thread context to set the "current span" appropriately. As we don't want to confuse thread-based context with what's in the message, we clear the headers.

    The "retry" case indeed puts this into question. What should be the parent in that case, and how would it be known? One of the issues with the instrumentation in question is we don't actually see the code that consumed the message.

    Concretely, the rabbitmq poll instrumentation is not there, so we put a "fake consumer span" to retroactively account for that. If the message were re-played.. perhaps a second consumer span is valid. Frankly, we didn't consider this.

    Anyway, my point is that we should not focus too much on difference between messaging tracing and RPC as there will be some intentional difference there. Let's focus on the gap itself and probably do that on gitter which would lead to a github issue I think.

    Anyway, I hope the context answers your question, even if it doesn't change the fact that the code currently does what it does.