Scale up preferred node pool in Azure Kubernetes Cluster

In the Azure Kubernetes Cluster there are two scaling node pools with gpu nodes. One of them has Azure Spot Instance enabled. When possible workloads should be deployed to nodes in the spot pool and the pool scaled accordingly.

These are the labels and taints of the two pools:

gpuscale1
Labels: sku=gpu, [...]
Taints: sku=gpu:NoSchedule

gpuspot1
Labels: sku=gpu, kubernetes.azure.com/scalesetpriority:spot, [...]
Taints: sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule

Now when scheduling a pod with the following tolerations and affinity, the auto-scaler scales up gpuscale1 instead of the gpuspot1.

  tolerations:
  - key: sku
    value: gpu
    effect: NoSchedule
  - key: kubernetes.azure.com/scalesetpriority
    value: spot
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: kubernetes.azure.com/scalesetpriority
            operator: In
            values:
            - spot

Ideally, the auto-scaler would first scale up gpuspot1 and only use gpuscale1 if no more spot nodes are available. How can this scenario be achieved? Is there an error in my configuration?

Solution

Yes, adding ConfigMap within the cluster for cluster-autoscaler-priority-expander is one way. I was suggesting the affinity way to optimize your AKS setup to preferentially scale and schedule workloads on the spot instance node pool (gpuspot1), you will need to adjust your Kubernetes deployment configurations to more strongly favor these nodes. First, ensure your node pools are correctly set up in AKS with the appropriate labels and taints

gpuscale1 (On-demand GPU Nodes)
- Labels: sku=gpu
- Taints: sku=gpu:NoSchedule
gpuspot1 (Spot GPU Nodes)
- Labels: sku=gpu, kubernetes.azure.com/scalesetpriority:spot
- Taints: sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule

Modify your deployment to include stronger node affinity towards spot instances and appropriate tolerations.

example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-enabled-app
  labels:
    app: gpu-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: gpu-app
  template:
    metadata:
      labels:
        app: gpu-app
    spec:
      containers:
      - name: cuda-container
        image: nvidia/cuda:11.0-base
        resources:
          limits:
            nvidia.com/gpu: 1
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: "sku"
                operator: "In"
                values:
                - "gpu"
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: "kubernetes.azure.com/scalesetpriority"
                operator: "In"
                values:
                - "spot"

Here, tolerations will ensure that the pods can be scheduled on nodes with the specified taints. Affinity preferredDuringSchedulingIgnoredDuringExecution will prefer scheduling on spot instances but will not strictly require it. and Affinity requiredDuringSchedulingIgnoredDuringExecution will ensure that the pods be scheduled on nodes labeled with sku=gpu.

enter image description here

As discussed over comments, this setup ensures your Kubernetes deployments are leveraging the intended GPU resources by using node affinities to direct the workload specifically to nodes labeled with sku: gpu