Search code examples
google-cloud-platformterraformgoogle-cloud-pubsubgoogle-cloud-runterraform-provider-gcp

Avoid infinite PubSub loop when Cloud Run returns an error


I'm building an architecture for processing files. A script run on Cloud Functions downloads it, and then a Cloud Run application processes it. Until now, the communication between Cloud Functions and Cloud Run was performed through requests. However, this has the disadvantage that the Cloud Function needs to wait until the file processing has completed. I thought on using PubSub in between: Cloud Function publishes, and Cloud Run subscribes. However, I'm facing an undesired behavior when Cloud Run returns a non 200 status code: the PubSub message keeps being resent to Cloud Run on a infinite loop.

This is how I created the PubSub topic and the Cloud Run subscription:

resource "google_pubsub_topic" "schema_validator_trigger" {
  name   = "oney-schema-validator-trigger"
  labels = var.tags
}

resource "google_pubsub_subscription" "schema_validator_subscription" {
  name  = "oney_schema_validator_subscription"
  topic = google_pubsub_topic.schema_validator_trigger.name

  ack_deadline_seconds = 600

  expiration_policy {
    ttl = ""
  }

  push_config {
    push_endpoint = "https://my-cloudrun-url.a.run.app/my/endpoint"
    oidc_token {
      service_account_email = google_service_account.downloader_sa.email
    }
    attributes = {
      x-goog-version = "v1"
    }
  }
}

Inside Cloud Run, I'm applying certain validations to check that the received PubSub message has the proper format:

from flask import Flask, request

app = Flask(__name__)

@app.route("/my/endpoint", methods=["POST"])
def my_endpoint():
    # Check received PubSub message
    envelope = request.get_json()

    expected_params = ["bucketname", "filename"]
    msg, status_code = utils.requests.validate_pubsub_message(envelope, expected_params)
    if status_code != 200:
        utils.logging.log_message(msg, severity="ERROR")
        return msg, status_code

    ...

Here is where the problem comes. If I return a 400 status code, the message keeps being resent to the endpoint no matter what, and it won't stop until I purge it from the topic.

Desired behavior: if my Cloud Run returns a non 200 status code, the problem is on the request. Even if my server crashes and returns a 500 status code, I would still be notified through the logs and alerting and I will work on the solution and resend the PubSub message when the fix is done. I want the PubSub message to be ACKed once the Cloud Run application provides a response, no matter its status code. Is this possible, or am I making a design error?


Solution

  • Returning a non-200 status code is an indicator to Pub/Sub that the message should be redelivered. For transient failures, whether they are ones caused by something in the transport layer or in the end application itself, the response allows the service to know that redelivery is needed.

    If you have a permanent failure for which you do not want the message to be redelivered, return a 200 status so it is not redelivered. If you are able to detect that these messages are bad, you might consider handling them by publishing them to another topic that acts as a dead letter topic so you can examine the requests that failed. You can also set up this behavior within Pub/Sub itself so that after a desired number of failed sends, the message is moved to another topic, but the minimum number of delivery attempts is 5.