Search code examples
google-cloud-platformcloudgoogle-cloud-build

Google cloud build adding GPU at build time


I've been trying to add a GPU support at build time (for compiling flash attention for example) in Google cloud build but I encounter some issues. (Also discussed here : How do I attach a GPU to a Google Cloud build?)

Explanations

The idea was to create a new builder by following the instructions in this repo https://github.com/GoogleCloudPlatform/cloud-builders-community/tree/master/remote-builder.

I've started by building a gpu builder :

steps:
- name: 'gcr.io/cloud-builders/docker'
  env: 
  - ZONE=us-east5-b
  args: ['build', '-t', 'grc.io/project/builder/gpu-builder:latest', '.']
images:
- 'grc.io/project/buider/gpu-builder:latest'
options:
  machineType: 'E2_HIGHCPU_8'

that I've built using the command :

gcloud builds submit --config=cloudbuild.yaml .

The image is pushed and everything works fine.

The goal was to use this builder and adding the --accelerator argument to enable the GPU as an instance as explained in the remote-builder repo, so I created the following config file :

steps: 
- name: grc.io/project/builder/gpu-builder:latest
  waitFor: ["-"]
  env: 
    - INSTANCE_ARGS=--accelerator=type=nvidia-a100-80gb,count=1
    - USERNAME=cloud-user
    - COMMAND=docker run -v /home/cloud-user/workspace:/workspace ubuntu:16.04 bash -xe /workspace/test-scripts/no-op.sh
options:
  machineType: 'E2_HIGHCPU_8'

PS : let's note that I've tried to change the machine type HOWEVER it's not taken into account.

The problem

Whenever I try to change the machineType (because it's necessary to be in a certain zone to use certain GPUs), It always tries to take the 'n1-standard-1' machine type to build. Basically this is the error :

 - Invalid value for field 'resource.machineType': 'https://compute.googleapis.com/compute/v1/projects/operation-405916/zones/us-east5-b/machineTypes/n1-standard-1'. Machine type with name 'n1-standard-1' does not exist in zone 'us-east5-b'.

When I try to use an instance in a zone where I have the n1-standard-1 I get the following error :

ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - [n1-standard-1, nvidia-a100-80gb] features are not compatible for creating instance.

(And of course, I verified that the A100 is available at the chosen zone)

[UPDATE] So it works with the answer of putting the machine type with the INSTANCE_ARGS, however the technique doesn't work either :

+ gcloud compute instances create --accelerator=type=nvidia-a100-80gb,count=1 --machine-type a2-ultragpu-1g builder-9821e1c9-58f1-4829-badb-4cf5a3e69a1f --metadata block-project-ssh-keys=TRUE --metadata-from-file ssh-keys=ssh-keys
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - Instances with guest accelerators do not support live migration.
ERROR

[UPDATE OF THE UPDATE]

Adding --maintenance-policy terminate into the INSTANCE_ARGS might resolve the issue.

Questions

=> What might be the issue ? I've tried changing the with --machine-type argument but also by adding options in the config file.

=> Is there any other way to use Cloud build with a GPU ?


Solution

  • You must specify your machine type in the INSTANCE_ARGS, like that

    steps: 
    - name: grc.io/project/builder/gpu-builder:latest
      waitFor: ["-"]
      env: 
        - INSTANCE_ARGS=--accelerator=type=nvidia-a100-80gb,count=1 --machine-type e2-highcpu-8
        - USERNAME=cloud-user
        - COMMAND=docker run -v /home/cloud-user/workspace:/workspace ubuntu:16.04 bash -xe /workspace/test-scripts/no-op.sh
    

    The option section is made for the Cloud Build machine, not for the remote builder.