Search code examples
azurekubernetesazure-akskedakeda-scaledjob

AKS Keda: Pods are removed during execution


Pods are getting removed during execution and getting below error:

##[error]We stopped hearing from agent matlab-aks-agent-6d69d8f7c5-7pt4n. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610

Updated the yml from deployment to jobs as mentioned in KEDA official documentation:KEDA Documentation

apiVersion: batch/v1
kind: Job
metadata:

However, issue persists


Solution

  • As discussed in comments, If your cluster shows high CPU and memory usage with overcommitted resources, this could lead to instability, especially with KEDA-managed jobs where resources need to be dynamically allocated. Overcommitted resources can cause the Kubernetes scheduler to terminate or evict pods when the node resources are insufficient to meet demands.

    As I can read from the KEDA doc you shared, it suggests that using scaledjob is ideal for long-running jobs because it provides each job with a fully isolated environment and automatically scales agents based on pending jobs in the Azure Pipelines queue.

    Therefore, using the above logic would recommend you to update your scaledjob yaml accordingly once your keda is deployed on your aks

    enter image description here

    Modify the Dockerfile to support long-running jobs with proper resource allocation and environment variables.

    FROM mcr.microsoft.com/dotnet/runtime:6.0
    
    RUN apt-get update && \
        apt-get install -y curl jq
    
    WORKDIR /azp
    
    ENV AZP_AGENT_PACKAGE=https://vstsagentpackage.azureedge.net/agent/2.206.1/vsts-agent-linux-x64-2.206.1.tar.gz
    
    RUN curl -LsS ${AZP_AGENT_PACKAGE} | tar -xz
    
    CMD [ "/bin/bash", "-c", "/azp/run.sh" ]
    
    

    enter image description here

    Adjust scaledjob settings to optimize for long-running jobs and minimize premature scaling.

    apiVersion: keda.sh/v1alpha1
    kind: ScaledJob
    metadata:
      name: azure-devops-scaledjob
      namespace: default
    spec:
      jobTargetRef:
        template:
          spec:
            containers:
            - name: azdevops-agent
              image: arkoacr.azurecr.io/azure-devops-agent:latest
              env:
                - name: AZP_URL
                  value: "https://dev.azure.com/arko"
                - name: AZP_POOL
                  value: "arkodemo"
                - name: AZP_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: pipeline-auth
                      key: personalAccessToken
              resources:
                requests:
                  cpu: "500m"       # Here you adjust these resource values based on actual needs
                  memory: "1Gi"     # same logic for memory
                limits:
                  cpu: "1000m"      # Set a reasonable upper limit, such as 1 vCPU
                  memory: "2Gi"     # Set an upper memory limit to prevent high resource usage
            restartPolicy: Never
      pollingInterval: 30
      successfulJobsHistoryLimit: 5
      failedJobsHistoryLimit: 5
      maxReplicaCount: 5
      scalingStrategy:
        strategy: "default"
      triggers:
      - type: azure-pipelines
        metadata:
          poolID: "12"
          organizationURLFromEnv: "AZP_URL"
          personalAccessTokenFromEnv: "AZP_TOKEN"
    
    

    enter image description here

    This updated scaledjob configuration should help to prevent overcommitting resources and ensures that each agent pod has a predictable amount of resources.

    enter image description here

    Other two ways I already told you in comments. One more way is you can even set resource quota on your cluster at namespace level

    Something like this-

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: resource-quota
      namespace: default
    spec:
      hard:
        requests.cpu: "10000m"             
        requests.memory: "20Gi"            
        limits.cpu: "15000m"               
        limits.memory: "30Gi"              
    

    Also, I found this on SO for your help.