In the Azure Kubernetes Cluster there are two scaling node pools with gpu nodes. One of them has Azure Spot Instance enabled. When possible workloads should be deployed to nodes in the spot pool and the pool scaled accordingly.
These are the labels and taints of the two pools:
gpuscale1
Labels: sku=gpu, [...]
Taints: sku=gpu:NoSchedule
gpuspot1
Labels: sku=gpu, kubernetes.azure.com/scalesetpriority:spot, [...]
Taints: sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule
Now when scheduling a pod with the following tolerations and affinity, the auto-scaler scales up gpuscale1
instead of the gpuspot1
.
tolerations:
- key: sku
value: gpu
effect: NoSchedule
- key: kubernetes.azure.com/scalesetpriority
value: spot
effect: "NoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
Ideally, the auto-scaler would first scale up gpuspot1
and only use gpuscale1
if no more spot nodes are available. How can this scenario be achieved? Is there an error in my configuration?
Yes, adding ConfigMap within the cluster for cluster-autoscaler-priority-expander is one way. I was suggesting the affinity way to optimize your AKS setup to preferentially scale and schedule workloads on the spot instance node pool (gpuspot1
), you will need to adjust your Kubernetes deployment configurations to more strongly favor these nodes. First, ensure your node pools are correctly set up in AKS with the appropriate labels and taints
gpuscale1
(On-demand GPU Nodes)
sku=gpu
sku=gpu:NoSchedule
gpuspot1
(Spot GPU Nodes)
sku=gpu, kubernetes.azure.com/scalesetpriority:spot
sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule
Modify your deployment to include stronger node affinity towards spot instances and appropriate tolerations.
example
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-enabled-app
labels:
app: gpu-app
spec:
replicas: 3
selector:
matchLabels:
app: gpu-app
template:
metadata:
labels:
app: gpu-app
spec:
containers:
- name: cuda-container
image: nvidia/cuda:11.0-base
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "sku"
operator: "In"
values:
- "gpu"
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "In"
values:
- "spot"
Here, tolerations will ensure that the pods can be scheduled on nodes with the specified taints. Affinity preferredDuringSchedulingIgnoredDuringExecution
will prefer scheduling on spot instances but will not strictly require it.
and Affinity requiredDuringSchedulingIgnoredDuringExecution
will ensure that the pods be scheduled on nodes labeled with sku=gpu
.
As discussed over comments, this setup ensures your Kubernetes deployments are leveraging the intended GPU resources by using node affinities to direct the workload specifically to nodes labeled with sku: gpu