Search code examples
apache-kafkaaws-lambdatraceaws-xray

AWS XRay not showing traces for lambda with Kafka event source


I had the following infra set up:

SelfManagedKafkaEventSource --> Lambda#A* --> ApiGateway#B* --> Lambda#B*

resource* : resource with XRay enabled/instrumented

First, I enabled XRay, and instrumented code, for three of Lambda#A, ApiGateway#B and Lambda#B, because Lambda#A calls some external clients and I want traces for that. When testing manually (using the lambda console) in my dev environment, WITHOUT the KafkaEventSource, everything looked fine, i.e: I was getting the traces for every call.

Then, in my prod env, WITH SelfManagedKafkaEventSource, Xray stopped working. That was the moment I realized that X-Ray tracing is currently not supported for Lambda functions with Amazon Managed Streaming for Apache Kafka. The weird thing is that it stopped working even for the Lambda#B, which is NOT connected to KafkaEventSource directly! It only "worked" when I tested it manually (through lambda console).

Anyway, I then thought about adding a KafkaProxy function with XRay DISABLED and directly connected to the KafkaEventSource, so the architecture would look like this:

SelfManagedKafkaEventSource --> KafkaProxyLambda --> Lambda#A* --> ApiGateway#B* --> Lambda#B*

But I still didn't get traces of any kind, UNLESS I test the KafkaProxyLambda manually through the lambda console.

Finally the only solution I could find was to add ANOTHER api gateway in front of `Lambda#A:

SelfManagedKafkaEventSource --> KafkaProxyLambda --> ApiGateway#A --> Lambda#A* --> ApiGateway#B* --> Lambda#B*

Has anyone dealt with this before? Am I overlooking something? Is there any other way to get traces in this case?


Solution

  • When using the Kafka source, your lambda function receives a trace context where the samplig decision has already been made, not to sample. This is why the trace doesn't show up.

    This trace context (with the negative sampling decision) is propagated to downstream services both by the Lambda#A function (because it's instrumented), and the KafkaProxyLambda function (because the aws sdk picks up the context from Lambda without instrumentation). The last setup works, because the HTTP library without instrumentation breaks the context propagation, so the ApiGateway#A doesn't receive any trace header, and the ActiveTracing configuration is applied.

    You may be able to make the 2nd setup (with KafkaProxyLambda --> Lambda#A*) work if you manually overwrite the tracing context in the KafkaProxyLambda function and generate a new trace header.