I'm using the following lines of code to specify the desired machine type and accelerator/GPU on a Kubeflow Pipeline (KFP) that I will be running on a serverless manner through Vertex AI/Pipelines.
op().
set_cpu_limit(8).
set_memory_limit(50G).
add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-k80').
set_gpu_limit(1)
and it works for other GPUs as well i.e. Pascal, Tesla, Volta cards.
However, I can't do the same with the latest accelerator type which is the Tesla A100
as it requires a special machine type, which is as least an a2-highgpu-1g
.
How do I make sure that this particular component will run on top of a2-highgpu-1g
when I run it on Vertex?
If i simply follow the method for older GPUs:
op().
set_cpu_limit(12). # max for A2-highgpu-1g
set_memory_limit(85G). # max for A2-highgpu-1g
add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-a100').
set_gpu_limit(1)
It throws an error when run/deployed since the machine type that is being spawned is the general type i.e. N1-Highmem-*
Same thing happened when I did not specify the cpu and memory limits, in hope that it will automatically select the right machnie type based on the accelerator constraint.
op().
add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-a100').
set_gpu_limit(1)
Error:
"NVIDIA_TESLA_A100" is not supported for machine type "n1-highmem-2",
Currently, GCP don't support A2 Machine type for normal KF Components. A potential workaround right now is to use GCP custom job component that you can explicitly specify the machine type.