Search code examples
amazon-web-servicesgoaws-lambdaopen-telemetryaws-lambda-layers

OpenTelemetry Lambda Layer


Is there any way to lessen the Lambda Layer dropped events? It keeps on dropping the traces before they reached the central collector. Before it exports the traces, it will then fetch the token to make an authorized sending of traces to the central collector. But it does not push the traces as it is being dropped because the lambda function execution is already done.

Lambda Extension Layer Reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/collector

Exporter Error:

Exporting failed. No more retries left. Dropping data.
{
    "kind": "exporter",
    "data_type": "traces",
    "name": "otlp",
    "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
    "dropped_items": 8
}

Solution

  • I encountered the same problem and did some research. Unfortunately, it is a known issue that has not been resolved yet in the latest version of AWS Distro for OpenTelemetry Lambda (ADOT Lambda)

    Github issue tickets:

    The short answer: currently the otel collector extension does not work reliably as it gets frozen by the lamda environment while it is still sending data to the exporters. As a workaround, you can send the traces directly to a collector running outside the lambda container.

    The problem is:

    • the lambda sends the traces to the collector extension process during its execution
    • the collector queues them for sending them on to the configured exporters
    • the collector extension does not wait for the collector to finish processing its queue before telling the lambda environment that the extension is done; instead it always immediately tells the environment immediately that it's done, without looking at what the collector is doing
    • when the lambda is done, the extension is already done, so the lambda container is frozen until the next lambda invocation.
    • the container is thawed when the next lambda invocation arrives. if the next invocation comes soon and takes long enough, the collector may be able to finish sending the traces to the exporters. if not, the connection to the backend system times out before sending is complete.

    What complicates the solution is that it is very hard for an extension to detect whether the main lambda has finished processing.

    Ideally, a telemetry extension would:

    1. Wait for the lambda to finish processing
    2. Check if the lambda sent it any data to process and forward
    3. Wait for all processing and forwarding to complete (if any)
    4. Signal to the lambda environment that the extension is done

    The lambda extension protocol doesn't tell the extension when the main lambda has finished processing (it would be great if AWS could add that to the extension protocol as a new event type).

    There is a proposed PR that tries to work around this by assuming that lambdas always send traces, so instead of waiting for the lambda to complete, it waits for a TCP request to the OTLP receiver to arrive. This works, but it makes the extension hang forever if the lambda never sends any traces.

    Note: the same problem that we see here for traces also exists for metrics.