azure kubernetes azure-aks keda keda-scaledjob

AKS Keda: Pods are removed during execution

Pods are getting removed during execution and getting below error:

##[error]We stopped hearing from agent matlab-aks-agent-6d69d8f7c5-7pt4n. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Updated the yml from deployment to jobs as mentioned in KEDA official documentation:KEDA Documentation

apiVersion: batch/v1
kind: Job
metadata:

However, issue persists

Solution

As discussed in comments, If your cluster shows high CPU and memory usage with overcommitted resources, this could lead to instability, especially with KEDA-managed jobs where resources need to be dynamically allocated. Overcommitted resources can cause the Kubernetes scheduler to terminate or evict pods when the node resources are insufficient to meet demands.

As I can read from the KEDA doc you shared, it suggests that using scaledjob is ideal for long-running jobs because it provides each job with a fully isolated environment and automatically scales agents based on pending jobs in the Azure Pipelines queue.

Therefore, using the above logic would recommend you to update your scaledjob yaml accordingly once your keda is deployed on your aks

enter image description here

Modify the Dockerfile to support long-running jobs with proper resource allocation and environment variables.

FROM mcr.microsoft.com/dotnet/runtime:6.0

RUN apt-get update && \
    apt-get install -y curl jq

WORKDIR /azp

ENV AZP_AGENT_PACKAGE=https://vstsagentpackage.azureedge.net/agent/2.206.1/vsts-agent-linux-x64-2.206.1.tar.gz

RUN curl -LsS ${AZP_AGENT_PACKAGE} | tar -xz

CMD [ "/bin/bash", "-c", "/azp/run.sh" ]

enter image description here

Adjust scaledjob settings to optimize for long-running jobs and minimize premature scaling.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: azure-devops-scaledjob
  namespace: default
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: azdevops-agent
          image: arkoacr.azurecr.io/azure-devops-agent:latest
          env:
            - name: AZP_URL
              value: "https://dev.azure.com/arko"
            - name: AZP_POOL
              value: "arkodemo"
            - name: AZP_TOKEN
              valueFrom:
                secretKeyRef:
                  name: pipeline-auth
                  key: personalAccessToken
          resources:
            requests:
              cpu: "500m"       # Here you adjust these resource values based on actual needs
              memory: "1Gi"     # same logic for memory
            limits:
              cpu: "1000m"      # Set a reasonable upper limit, such as 1 vCPU
              memory: "2Gi"     # Set an upper memory limit to prevent high resource usage
        restartPolicy: Never
  pollingInterval: 30
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
  maxReplicaCount: 5
  scalingStrategy:
    strategy: "default"
  triggers:
  - type: azure-pipelines
    metadata:
      poolID: "12"
      organizationURLFromEnv: "AZP_URL"
      personalAccessTokenFromEnv: "AZP_TOKEN"

enter image description here

This updated scaledjob configuration should help to prevent overcommitting resources and ensures that each agent pod has a predictable amount of resources.

enter image description here

Other two ways I already told you in comments. One more way is you can even set resource quota on your cluster at namespace level

Something like this-

apiVersion: v1
kind: ResourceQuota
metadata:
  name: resource-quota
  namespace: default
spec:
  hard:
    requests.cpu: "10000m"             
    requests.memory: "20Gi"            
    limits.cpu: "15000m"               
    limits.memory: "30Gi"

Also, I found this on SO for your help.