Search code examples
azurekubernetesazure-aksautoscaling

Scale up preferred node pool in Azure Kubernetes Cluster


In the Azure Kubernetes Cluster there are two scaling node pools with gpu nodes. One of them has Azure Spot Instance enabled. When possible workloads should be deployed to nodes in the spot pool and the pool scaled accordingly.

These are the labels and taints of the two pools:

gpuscale1
Labels: sku=gpu, [...]
Taints: sku=gpu:NoSchedule
gpuspot1
Labels: sku=gpu, kubernetes.azure.com/scalesetpriority:spot, [...]
Taints: sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule

Now when scheduling a pod with the following tolerations and affinity, the auto-scaler scales up gpuscale1 instead of the gpuspot1.

  tolerations:
  - key: sku
    value: gpu
    effect: NoSchedule
  - key: kubernetes.azure.com/scalesetpriority
    value: spot
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: kubernetes.azure.com/scalesetpriority
            operator: In
            values:
            - spot   

Ideally, the auto-scaler would first scale up gpuspot1 and only use gpuscale1 if no more spot nodes are available. How can this scenario be achieved? Is there an error in my configuration?


Solution

  • Yes, adding ConfigMap within the cluster for cluster-autoscaler-priority-expander is one way. I was suggesting the affinity way to optimize your AKS setup to preferentially scale and schedule workloads on the spot instance node pool (gpuspot1), you will need to adjust your Kubernetes deployment configurations to more strongly favor these nodes. First, ensure your node pools are correctly set up in AKS with the appropriate labels and taints

    • gpuscale1 (On-demand GPU Nodes)

      • Labels: sku=gpu
      • Taints: sku=gpu:NoSchedule
    • gpuspot1 (Spot GPU Nodes)

      • Labels: sku=gpu, kubernetes.azure.com/scalesetpriority:spot
      • Taints: sku=gpu:NoSchedule, kubernetes.azure.com/scalesetpriority=spot:NoSchedule

    Modify your deployment to include stronger node affinity towards spot instances and appropriate tolerations.

    example

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gpu-enabled-app
      labels:
        app: gpu-app
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: gpu-app
      template:
        metadata:
          labels:
            app: gpu-app
        spec:
          containers:
          - name: cuda-container
            image: nvidia/cuda:11.0-base
            resources:
              limits:
                nvidia.com/gpu: 1
          tolerations:
          - key: "sku"
            operator: "Equal"
            value: "gpu"
            effect: "NoSchedule"
          - key: "kubernetes.azure.com/scalesetpriority"
            operator: "Equal"
            value: "spot"
            effect: "NoSchedule"
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: "sku"
                    operator: "In"
                    values:
                    - "gpu"
              preferredDuringSchedulingIgnoredDuringExecution:
              - weight: 100
                preference:
                  matchExpressions:
                  - key: "kubernetes.azure.com/scalesetpriority"
                    operator: "In"
                    values:
                    - "spot"
    

    Here, tolerations will ensure that the pods can be scheduled on nodes with the specified taints. Affinity preferredDuringSchedulingIgnoredDuringExecution will prefer scheduling on spot instances but will not strictly require it. and Affinity requiredDuringSchedulingIgnoredDuringExecution will ensure that the pods be scheduled on nodes labeled with sku=gpu.

    enter image description here

    As discussed over comments, this setup ensures your Kubernetes deployments are leveraging the intended GPU resources by using node affinities to direct the workload specifically to nodes labeled with sku: gpu