I've been trying to add a GPU support at build time (for compiling flash attention for example) in Google cloud build but I encounter some issues. (Also discussed here : How do I attach a GPU to a Google Cloud build?)
The idea was to create a new builder by following the instructions in this repo https://github.com/GoogleCloudPlatform/cloud-builders-community/tree/master/remote-builder.
I've started by building a gpu builder :
steps:
- name: 'gcr.io/cloud-builders/docker'
env:
- ZONE=us-east5-b
args: ['build', '-t', 'grc.io/project/builder/gpu-builder:latest', '.']
images:
- 'grc.io/project/buider/gpu-builder:latest'
options:
machineType: 'E2_HIGHCPU_8'
that I've built using the command :
gcloud builds submit --config=cloudbuild.yaml .
The image is pushed and everything works fine.
The goal was to use this builder and adding the --accelerator
argument to enable the GPU as an instance as explained in the remote-builder repo, so I created the following config file :
steps:
- name: grc.io/project/builder/gpu-builder:latest
waitFor: ["-"]
env:
- INSTANCE_ARGS=--accelerator=type=nvidia-a100-80gb,count=1
- USERNAME=cloud-user
- COMMAND=docker run -v /home/cloud-user/workspace:/workspace ubuntu:16.04 bash -xe /workspace/test-scripts/no-op.sh
options:
machineType: 'E2_HIGHCPU_8'
PS : let's note that I've tried to change the machine type HOWEVER it's not taken into account.
Whenever I try to change the machineType (because it's necessary to be in a certain zone to use certain GPUs), It always tries to take the 'n1-standard-1'
machine type to build. Basically this is the error :
- Invalid value for field 'resource.machineType': 'https://compute.googleapis.com/compute/v1/projects/operation-405916/zones/us-east5-b/machineTypes/n1-standard-1'. Machine type with name 'n1-standard-1' does not exist in zone 'us-east5-b'.
When I try to use an instance in a zone where I have the n1-standard-1
I get the following error :
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- [n1-standard-1, nvidia-a100-80gb] features are not compatible for creating instance.
(And of course, I verified that the A100 is available at the chosen zone)
[UPDATE]
So it works with the answer of putting the machine type with the INSTANCE_ARGS
, however the technique doesn't work either :
+ gcloud compute instances create --accelerator=type=nvidia-a100-80gb,count=1 --machine-type a2-ultragpu-1g builder-9821e1c9-58f1-4829-badb-4cf5a3e69a1f --metadata block-project-ssh-keys=TRUE --metadata-from-file ssh-keys=ssh-keys
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
- Instances with guest accelerators do not support live migration.
ERROR
[UPDATE OF THE UPDATE]
Adding --maintenance-policy terminate
into the INSTANCE_ARGS
might resolve the issue.
=> What might be the issue ? I've tried changing the with --machine-type
argument but also by adding options in the config file.
=> Is there any other way to use Cloud build with a GPU ?
You must specify your machine type in the INSTANCE_ARGS
, like that
steps:
- name: grc.io/project/builder/gpu-builder:latest
waitFor: ["-"]
env:
- INSTANCE_ARGS=--accelerator=type=nvidia-a100-80gb,count=1 --machine-type e2-highcpu-8
- USERNAME=cloud-user
- COMMAND=docker run -v /home/cloud-user/workspace:/workspace ubuntu:16.04 bash -xe /workspace/test-scripts/no-op.sh
The option section is made for the Cloud Build machine, not for the remote builder.