Search code examples
kubernetesazure-devopsazure-akskedakeda-scaledjob

AKS with Keda: pod are removed during execution


I tried Keda with AKS and I really appreciate when pod are automatically instanciate based on Azure Dev Ops queue job for release & build.

However I noticed something strange and often AKS/Keda remove pod while processing which makes workflow failed. enter image description here

Message reads We stopped hearing from agent aks-linux-768d6647cc-ntmh4. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Expected behavior: pods must complete the tasks then Keda/AKS can remove this pod.

I share with you my keda yml file:

# deployment.yaml
apiVersion: apps/v1 # The API resource where this workload resides
kind: Deployment # The kind of workload we're creating
metadata:
  name: aks-linux # This will be the name of the deployment
spec:
  selector: # Define the wrapping strategy
    matchLabels: # Match all pods with the defined labels
      app: aks-linux # Labels follow the `name: value` template
  replicas: 1
  template: # This is the template of the pod inside the deployment
    metadata: # Metadata for the pod
      labels:
        app: aks-linux
    spec:
      nodeSelector:
        agentpool: linux
      containers: # Here we define all containers
        - image: <My image here>
          name: aks-linux
          env:
            - name: "AZP_URL"
              value: "<myURL>"
            - name: "AZP_TOKEN"
              value: "<MyToken>"
            - name: "AZP_POOL"
              value: "<MyPool>"
          resources:
            requests: # Minimum amount of resources requested
              cpu: 2
              memory: 4096Mi
            limits: # Maximum amount of resources requested
              cpu: 4
              memory: 8192Mi

I used latest version of AKS and Keda. Any idea ?


Solution

  • Check the official Keda docs:

    When running your agents as a deployment you have no control on which pod gets killed when scaling down.

    So, to solve it you need to use ScaledJob:

    If you run your agents as a Job, KEDA will start a Kubernetes job for each job that is in the agent pool queue. The agents will accept one job when they are started and terminate afterwards. Since an agent is always created for every pipeline job, you can achieve fully isolated build environments by using Kubernetes jobs.

    See there how to implement it.