Search code examples
google-cloud-platformstackdrivergoogle-cloud-run

Stackdriver Trace with Google Cloud Run


I have been diving into a Stackdriver Trace integration on Google Cloud Run. I can get it to work with the agent, but I am bothered by a few questions.

Given that

  • The Stackdriver agent aggregates traces in a small buffer and sends them periodically.
  • CPU access is restricted when a Cloud Run service is not handling a request.
  • There is no shutdown hook for Cloud Run services; you can't clear the buffer before shutdown: the container just gets a SIGKILL. This is a signal you can't catch from your application.
  • Running a background process that sends information outside of the request-response cycle seems to violate the Knative Container Runtime contract
  • The collections of logging data is documented and does not require me to run an agent, but there is no such solution for telemetry.
  • I found one report of someone experiencing lost traces on Cloud Run using the agent-based approach

How Google does it

I went into the source code for the Cloud Endpoints ESP, (the Cloud Run integration is in beta) to see if they solve it in a different way, but there the same pattern is used: there is a buffer with traces (1s) and it is cleared periodically.

Question

While my tracing integration seems to work in my test setup, I am worried about incomplete and missing traces when I run this in a production environment.

  • Is this a hypothetical problem or a real issue?

  • It looks like the right way to approach this is to write telemetry to logs, instead of using an agent process. Is that supported with Stackdriver Trace?


Solution

  • Cloud Run now supports sending SIGTERM. If your application handles SIGTERM it'll get 10 seconds grace time before shutdown.

    You can use the 10 seconds to:

    • Flush buffers that have unsent data
    • Close connections to other systems

    Docs: Container runtime contract