Search code examples
google-cloud-runstackdrivergoogle-cloud-stackdriveropen-telemetrygoogle-cloud-trace

Trace Propagation on Google Cloud Run with OpenTelemetry


I have a Flask app talking to a Python gRPC service, both deployed on Google Cloud Run. I can see traces on Google Trace after instrumenting the apps, but they all appear to have different Trace IDs which means the traces are not being linked together between the two services. This is my setup code for tracing on both services with grpc/Flask instrumentors setup on each side:

import logging
from opentelemetry import trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleExportSpanProcessor
from opentelemetry.propagators import set_global_textmap
from opentelemetry.tools.cloud_trace_propagator import CloudTraceFormatPropagator
from google.auth.exceptions import DefaultCredentialsError

logger = logging.getLogger(__name__)

def setup_tracing():
    """
    Setup Tracing on Google Cloud. The Service Account Roles must have `Cloud Trace Agent`
    Role added for traces to be ingested.
    """

    trace.set_tracer_provider(TracerProvider())
    try:
        # If running on Google Cloud, will use instance metadata service account credentials to initialize
        trace.get_tracer_provider().add_span_processor(
            SimpleExportSpanProcessor(CloudTraceSpanExporter())
        )
        # Using the X-Cloud-Trace-Context header
        set_global_textmap(CloudTraceFormatPropagator())

        logger.info("Tracing Setup. Exporting Traces to Google Cloud.")
    except DefaultCredentialsError:
        # Not running on Google Cloud so will use console exporter
        from opentelemetry.sdk.trace.export import ConsoleSpanExporter
        trace.get_tracer_provider().add_span_processor(
            SimpleExportSpanProcessor(ConsoleSpanExporter())
        )
        logger.info("Tracing Setup. Exporting Traces to Console.")

Locally I can see with the ConsoleSpanExporter that the Trace IDs on both services match, however on Google Cloud Run they clearly don't resulting in separate traces on Google Trace, so I'm wondering if the Networking removes the headers between services or something else is happening which means the Trace ID isn't being propagated?

As an extra note I've also noticed that the load balancer in front of Cloud Run's Trace/Span IDs aren't being propagated using CloudTraceSpanFormatPropagator() which makes my logs messy too as the logs aren't nested together for requests.


Solution

  • After hours of debugging turns out it was bad documentation on the Python gRPC Client Instrumentation. For insecure (localhost) channels, the documentation works and the client is instrumented. For secure channels (as required for Google Cloud Run) you need to pass in channel_type='secure'. I'm not sure why it was designed this way and raised an issue on the module: https://github.com/open-telemetry/opentelemetry-python-contrib/issues/365

    In addition, you need to use the X-Cloud-Trace-Context header to ensure your traces use the same trace ID as the load balancer and AppServer on Google Cloud run and all link up in Google Trace, but the default implementation of their propagator uses upper case letters that can't be used in gRPC metadata keys so throws a validation error. I took the class below and made it all lowercase and it all works perfectly now:

    https://github.com/GoogleCloudPlatform/opentelemetry-operations-python/blob/master/opentelemetry-tools-google-cloud/src/opentelemetry/tools/cloud_trace_propagator.py

    Finally I had a long standing issue with linking my logs to traces on Google Cloud logs, the documentation says use a Hex Trace ID and Hex Span ID, but they didn't work as I was using the wrong OpenTelemetry functions to format them. However this code works and I can now see my logs alongside my traces in Google Trace's Trace List view now!

    from opentelemetry import trace
    from opentelemetry.trace.span import get_hexadecimal_trace_id, get_hexadecimal_span_id
    
            current_span = trace.get_current_span()
            if current_span:
                trace_id = current_span.get_span_context().trace_id
                span_id = current_span.get_span_context().span_id
                if trace_id and span_id:
                    logging_fields['logging.googleapis.com/trace'] = f"projects/{self.gce_project}/traces/{get_hexadecimal_trace_id(trace_id)}"
                    logging_fields['logging.googleapis.com/spanId'] = f"{get_hexadecimal_span_id(span_id)}"
                    logging_fields['logging.googleapis.com/trace_sampled'] = True
    

    It took a while, but I guess its my fault for picking an Alpha (just turned Beta) framework (OpenTelemetry) on a new, not very well documented (in this area) Google Cloud service. But with those fixes it all works now and much easier to debug issues and see the total end to end request!