Search code examples
google-cloud-platformgoogle-cloud-run

Cloud run deploy taking too long


Since 2024 August 14th, our deploy started to take too long to complete. We didn't change anything in the cloudbuild.yaml, but the process started to take too long. Here are the build steps we are using:

steps:
  - name: gcr.io/cloud-builders/mvn
    args: [ '-e', '-q', 'package']
  - name: gcr.io/cloud-builders/docker
    args: ['build', '.',
           '-t','$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME-$COMMIT_SHA',
           '-f','./Docker/gcp/Dockerfile']
  - name: gcr.io/cloud-builders/docker
    args: [ 'push', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME-$COMMIT_SHA']
  - name: gcr.io/cloud-builders/gcloud
    args: ['run', 'deploy', '$_SCHEDULED_SERVICE',
           '--image', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME-$COMMIT_SHA',
           '--region', '$_DEPLOY_REGION',
           '--platform', '$_PLATFORM']
  - name: gcr.io/cloud-builders/gcloud
    args: [ 'run', 'deploy', '$_BACKEND_SERVICE',
            '--image', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME-$COMMIT_SHA',
            '--region', '$_DEPLOY_REGION',
            '--platform', '$_PLATFORM' ]

As you could guess, we deploy two services in sequence. The _SCHEDULED_SERVICE is deployed first, then the _BACKEND_SERVICE. Both are cloud run services with 6 min instances each. Both are 16GB @ 4 vpu, also both share the same configs for startup and Liveness probes, which are:

initialDelaySeconds: 60
timeoutSeconds: 10
periodSeconds: 30
failureThreshold: 3
httpGet:
  path: /login
  port: 5000

Things were doing fine, until 2024 august 14th, when the last two steps began to take too long. Before, the last steps took 2 min each, then they started to take 8 to 7 minutes.

Checking the cloud logs it seems the instances are not being started in parallel as they were before august 14th. Cloud run appears to wait every instance to start before starting the next. The point is, we didn't change anything, but now the deploy is taking up to 15 min, from 4 min before.

I already checked the yaml on the revisons, there is no change at all. I also checked the change logs from the cloud run service on google website there were no updates on august 14th, hence I don't know if the date really matters.

Is anyone expecting the same issue? Do you have any suggestions I could use to improve the deploy time?


Solution

  • I really could not find what had happened with our project after August 14th 2024, however I was able to make it work by making significant changes to the build / deploy process. I'll post it here just in case someone (or myself in the future ;P) face something similar.

    Here's the cloudbuild.yaml we are using right now, with some comments:

    options:
      logging: CLOUD_LOGGING_ONLY
    steps:
      - name: gcr.io/cloud-builders/mvn
        args: [ '-e', '-q', 'package']
      - name: gcr.io/cloud-builders/docker
        args: ['build', '.',
               '-t','$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME:$COMMIT_SHA',
               '-f','./Docker/gcp/Dockerfile']
      - name: gcr.io/cloud-builders/docker
        args: [ 'push', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME:$COMMIT_SHA']
    #had to make this one because the next steps won't wait for
    #real service to start, so the process would pass even if wrong 
    #migrations/startup state was deployied. This will take 2 minutes or
    #more if there are long running migrations. This service is using
    #the startup probe I posted in the beginning (http /login), so 
    #it will wait the service to start before proceeding.
      - name: gcr.io/cloud-builders/gcloud
        id: InitialSetupAndMigrations
        args: [ 'run', 'deploy', '$_INITIAL_SETUP_MIGRATIONS',
                '--image', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME:$COMMIT_SHA',
                '--region', '$_DEPLOY_REGION',
                '--platform', '$_PLATFORM' ]
    #in the next two steps, I changed the startup probe by removing 
    #the previously posted (/login) and using the default one. The probe
    #basically tests if the container has started.
      - name: gcr.io/cloud-builders/gcloud
        waitFor:
          - InitialSetupAndMigrations
        args: ['run', 'deploy', '$_SCHEDULED_SERVICE',
               '--image', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME:$COMMIT_SHA',
               '--region', '$_DEPLOY_REGION',
               '--platform', '$_PLATFORM']
    #using --tag here is important, otherwise cloud run won't start
    #the containers. I also used no-traffic so the service would present
    #less errors during the startup of the new images.
      - name: gcr.io/cloud-builders/gcloud
        waitFor:
          - InitialSetupAndMigrations
        args: [ 'run', 'deploy', '$_SERVICE',
                '--image', '$_DEPLOY_REGION-docker.pkg.dev/$PROJECT_ID/$REPO_NAME/$REPO_NAME:$COMMIT_SHA',
                '--region', '$_DEPLOY_REGION',
                '--platform', '$_PLATFORM',
                '--tag', 'sha-$SHORT_SHA',
                '--no-traffic']
    #This is by-hand tunning, I know that the containers take close to
    #60secs to start.
      - name: gcr.io/cloud-builders/gcloud
        entrypoint: 'bash'
        args:
          - '-c'
          - |
            #!/usr/bin/env bash
            echo 'Waiting container startup'
            sleep 60
    #next two steps perform the traffic routing and remove the previoulsy
    #added tag.
      - name: gcr.io/cloud-builders/gcloud #setando trafego
        args: [ 'run', 'services', 'update-traffic', '$_SERVICE',
                '--region', '$_DEPLOY_REGION',
                '--to-tags','sha-$SHORT_SHA=100']
      - name: gcr.io/cloud-builders/gcloud #removendo traffic tag
        args: [ 'run', 'services', 'update-traffic', '$_SERVICE',
                '--region', '$_DEPLOY_REGION',
                '--remove-tags', 'sha-$SHORT_SHA' ]
    
    

    To make the instances start in parallel again I changed the startup probe on the services _SCHEDULED_SERVICE and _SERVICE. The new startup probe is:

    timeoutSeconds: 240
    periodSeconds: 240
    failureThreshold: 1
    tcpSocket:
      port: 5000
    

    This is actually the current "default" startup probe from cloud run, which is placed in case we create a service without a startup probe. What happens here is that the probe only tests if the container has started. When I was using the previously posted (http /login), since the service would take up to 1 min to start, cloud run would wait 1 minute before starting the next instance, thus taking N minutes if I have N instances on the service.

    By using the default probe, cloud run will start all containers in parallel, taking 1 minute to start the N containers. As I mentioned, this was done by default on our system when it suddenly changed.

    Anyway, it is working right now, thanks for all the help.