google-kubernetes-engine kubeflow-pipelines google-cloud-vertex-ai

Using Tesla A100 GPU with Kubeflow Pipelines on Vertex AI

I'm using the following lines of code to specify the desired machine type and accelerator/GPU on a Kubeflow Pipeline (KFP) that I will be running on a serverless manner through Vertex AI/Pipelines.

op().
set_cpu_limit(8).
set_memory_limit(50G).
add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-k80').
set_gpu_limit(1)

and it works for other GPUs as well i.e. Pascal, Tesla, Volta cards.

However, I can't do the same with the latest accelerator type which is the Tesla A100 as it requires a special machine type, which is as least an a2-highgpu-1g.

How do I make sure that this particular component will run on top of a2-highgpu-1g when I run it on Vertex?

If i simply follow the method for older GPUs:

op().
set_cpu_limit(12). # max for A2-highgpu-1g
set_memory_limit(85G). # max for A2-highgpu-1g
add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-a100').
set_gpu_limit(1)

It throws an error when run/deployed since the machine type that is being spawned is the general type i.e. N1-Highmem-*

Same thing happened when I did not specify the cpu and memory limits, in hope that it will automatically select the right machnie type based on the accelerator constraint.

    op().
    add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-a100').
    set_gpu_limit(1)

Error: "NVIDIA_TESLA_A100" is not supported for machine type "n1-highmem-2",

Solution

Currently, GCP don't support A2 Machine type for normal KF Components. A potential workaround right now is to use GCP custom job component that you can explicitly specify the machine type.