Pods are getting removed during execution and getting below error:
##[error]We stopped hearing from agent matlab-aks-agent-6d69d8f7c5-7pt4n. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error. For more information, see: https://go.microsoft.com/fwlink/?linkid=846610
Updated the yml from deployment to jobs as mentioned in KEDA official documentation:KEDA Documentation
apiVersion: batch/v1
kind: Job
metadata:
However, issue persists
As discussed in comments, If your cluster shows high CPU and memory usage with overcommitted resources, this could lead to instability, especially with KEDA-managed jobs where resources need to be dynamically allocated. Overcommitted resources can cause the Kubernetes scheduler to terminate or evict pods when the node resources are insufficient to meet demands.
As I can read from the KEDA doc you shared, it suggests that using scaledjob is ideal for long-running jobs because it provides each job with a fully isolated environment and automatically scales agents based on pending jobs in the Azure Pipelines queue.
Therefore, using the above logic would recommend you to update your scaledjob yaml accordingly once your keda is deployed on your aks
Modify the Dockerfile to support long-running jobs with proper resource allocation and environment variables.
FROM mcr.microsoft.com/dotnet/runtime:6.0
RUN apt-get update && \
apt-get install -y curl jq
WORKDIR /azp
ENV AZP_AGENT_PACKAGE=https://vstsagentpackage.azureedge.net/agent/2.206.1/vsts-agent-linux-x64-2.206.1.tar.gz
RUN curl -LsS ${AZP_AGENT_PACKAGE} | tar -xz
CMD [ "/bin/bash", "-c", "/azp/run.sh" ]
Adjust scaledjob settings to optimize for long-running jobs and minimize premature scaling.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: azure-devops-scaledjob
namespace: default
spec:
jobTargetRef:
template:
spec:
containers:
- name: azdevops-agent
image: arkoacr.azurecr.io/azure-devops-agent:latest
env:
- name: AZP_URL
value: "https://dev.azure.com/arko"
- name: AZP_POOL
value: "arkodemo"
- name: AZP_TOKEN
valueFrom:
secretKeyRef:
name: pipeline-auth
key: personalAccessToken
resources:
requests:
cpu: "500m" # Here you adjust these resource values based on actual needs
memory: "1Gi" # same logic for memory
limits:
cpu: "1000m" # Set a reasonable upper limit, such as 1 vCPU
memory: "2Gi" # Set an upper memory limit to prevent high resource usage
restartPolicy: Never
pollingInterval: 30
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
maxReplicaCount: 5
scalingStrategy:
strategy: "default"
triggers:
- type: azure-pipelines
metadata:
poolID: "12"
organizationURLFromEnv: "AZP_URL"
personalAccessTokenFromEnv: "AZP_TOKEN"
This updated scaledjob configuration should help to prevent overcommitting resources and ensures that each agent pod has a predictable amount of resources.
Other two ways I already told you in comments. One more way is you can even set resource quota on your cluster at namespace level
Something like this-
apiVersion: v1
kind: ResourceQuota
metadata:
name: resource-quota
namespace: default
spec:
hard:
requests.cpu: "10000m"
requests.memory: "20Gi"
limits.cpu: "15000m"
limits.memory: "30Gi"
Also, I found this on SO for your help.