Search code examples
amazon-web-serviceskubernetesamazon-ekskeda

EKS Pods being terminated for no reason


I wonder if someone can help me.

Kubernetes (K8s 1.21 platform eks.4) is Terminating running pods without error or reason. The only thing I can see in the events is:

7m47s       Normal    Killing                   pod/test-job-6c9fn-qbzkb                          Stopping container test-job

Because I've set up an anti-affinity rule, only one pod can run in one node. So every time a pod gets killed, autoscaler brings up another node.

These are the cluster-autoscaler logs

I0208 19:10:42.336476       1 cluster.go:148] Fast evaluation: ip-10-4-127-38.us-west-2.compute.internal for removal
I0208 19:10:42.336484       1 cluster.go:169] Fast evaluation: node ip-10-4-127-38.us-west-2.compute.internal cannot be removed: pod annotated as not safe to evict present: test-job-6c9fn-qbzkb
I0208 19:10:42.336493       1 scale_down.go:612] 1 nodes found to be unremovable in simulation, will re-check them at 2022-02-08 19:15:42.335305238 +0000 UTC m=+20363.008486077

I0208 19:15:04.360683       1 klogx.go:86] Pod default/test-job-6c9fn-8wx2q is unschedulable
I0208 19:15:04.360719       1 scale_up.go:376] Upcoming 0 nodes
I0208 19:15:04.360861       1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.360901       1 scale_up.go:449] No pod can fit to eks-ec2-8xlarge-84bf6ad9-ca4a-4293-a3e8-95bef28db16d
I0208 19:15:04.361035       1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361062       1 scale_up.go:449] No pod can fit to eks-ec2-inf1-90bf6ad9-caf7-74e8-c930-b80f785bc743
I0208 19:15:04.361162       1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361194       1 scale_up.go:449] No pod can fit to eks-ec2-large-62bf6ad9-ccd4-6e03-5c78-c3366d387d50
I0208 19:15:04.361512       1 scale_up.go:412] Skipping node group eks-eks-on-demand-10bf6ad9-c978-9b35-c7fc-cdb9977b27cb - max size reached
I0208 19:15:04.361675       1 scale_up.go:300] Pod test-job-6c9fn-8wx2q can't be scheduled on eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da, predicate checking error: node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; debugInfo=
I0208 19:15:04.361711       1 scale_up.go:449] No pod can fit to eks-ec2-test-58bf6d43-13e8-9acc-5173-b8c5054a56da
I0208 19:15:04.361723       1 waste.go:57] Expanding Node Group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f would waste 75.00% CPU, 86.92% Memory, 80.96% Blended
I0208 19:15:04.361747       1 scale_up.go:468] Best option to resize: eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361762       1 scale_up.go:472] Estimated 1 nodes needed in eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f
I0208 19:15:04.361780       1 scale_up.go:586] Final scale-up plan: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]
I0208 19:15:04.361801       1 scale_up.go:675] Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.361826       1 auto_scaling_groups.go:219] Setting asg eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.362154       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.374021       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"81b80048-920c-4bf1-b2c0-ad5d067d74f4", APIVersion:"v1", ResourceVersion:"359476", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: setting group eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f size to 1
I0208 19:15:04.541658       1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop
I0208 19:15:04.541859       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"test-job-6c9fn-8wx2q", UID:"67beba1d-4f52-4860-91af-89e5852e4cad", APIVersion:"v1", ResourceVersion:"359507", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-ec2-xlarge-84bf6ad9-cb6d-7e24-7eb5-a00c369fd82f 0->1 (max: 2)}]

I'm running an EKS cluster with cluster-autoscaler and keda's aws-sqs trigger. I've set up an autoscaling node group with SPOT instances.

For testing purposes I've defined an ScaledJob consisting on a container with a simple python script, looping through time.sleep. The pod should run for 30 mins. But it never gets so far. In general it ends after 15 mins.

{
            "apiVersion": "keda.sh/v1alpha1",
            "kind": "ScaledJob",
            "metadata": {
                "name": id,
                "labels": {"myjobidentifier": id},
                "annotations": {
                    "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
                },
            },
            "spec": {
                "jobTargetRef": {
                    "parallelism": 1,
                    "completions": 1,
                    "backoffLimit": 0,
                    "template": {
                        "metadata": {
                            "labels": {"job-type": id},
                            "annotations": {
                                "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
                            },
                        },
                        "spec": {
                            "affinity": {
                                "nodeAffinity": {
                                    "requiredDuringSchedulingIgnoredDuringExecution": {
                                        "nodeSelectorTerms": [
                                            {
                                                "matchExpressions": [
                                                    {
                                                        "key": "eks.amazonaws.com/nodegroup",
                                                        "operator": "In",
                                                        "values": group_size,
                                                    }
                                                ]
                                            }
                                        ]
                                    }
                                },
                                "podAntiAffinity": {
                                    "requiredDuringSchedulingIgnoredDuringExecution": [
                                        {
                                            "labelSelector": {
                                                "matchExpressions": [
                                                    {
                                                        "key": "job-type",
                                                        "operator": "In",
                                                        "values": [id],
                                                    }
                                                ]
                                            },
                                            "topologyKey": "kubernetes.io/hostname",
                                        }
                                    ]
                                },
                            },
                            "serviceAccountName": service_account.service_account_name,
                            "containers": [
                                {
                                    "name": id,
                                    "image": image.image_uri,
                                    "imagePullPolicy": "IfNotPresent",
                                    "env": envs,
                                    "resources": {
                                        "requests": requests,
                                    },
                                    "volumeMounts": [
                                        {
                                            "mountPath": "/tmp",
                                            "name": "tmp-volume",
                                        }
                                    ],
                                }
                            ],
                            "volumes": [
                                {"name": "tmp-volume", "emptyDir": {}}
                            ],
                            "restartPolicy": "Never",
                        },
                    },
                },
                "pollingInterval": 30,
                "successfulJobsHistoryLimit": 10,
                "failedJobsHistoryLimit": 100,
                "maxReplicaCount": 30,
                "rolloutStrategy": "default",
                "scalingStrategy": {"strategy": "default"},
                "triggers": [
                    {
                        "type": "aws-sqs-queue",
                        "metadata": {
                            "queueURL": queue.queue_url,
                            "queueLength": "1",
                            "awsRegion": region,
                            "identityOwner": "operator",
                        },
                    }
                ],
            },
        }

I know this is not a problem of resources (dummy code and large instances), nor a problem of eviction (it's clear from the logs that the pod is safe from eviction), but I really don't know how to troubleshoot this anymore.

thanks a lot!!

EDIT:

Same behavior with On-Demand and SPOT instances.

EDIT 2:

I added the aws node termination handler, it seems that now I'm seeing other events:

ip-10-4-126-234.us-west-2.compute.internal.16d223107de38c5f
NodeNotSchedulable
Node ip-10-4-126-234.us-west-2.compute.internal status is now: NodeNotSchedulable

test-job-p85f2-txflr.16d2230ea91217a9
FailedScheduling
0/2 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) were unschedulable.

If I check the scaling group activity:

Instance i-03d27a1cf341405e1 was taken out of service in response to a user request, shrinking the capacity from 1 to 0.

Solution

  • Well, this was an annoying, small, and tricky thing.

    There was another EKS Cluster in the account, but in that cluster, cluster-autoscaler was started like this:

    command:
                - ./cluster-autoscaler
                - --v=4
                - --stderrthreshold=info
                - --cloud-provider=aws
                - --skip-nodes-with-local-storage=false
                - --expander=least-waste
                - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
    

    This cluster-autoscaler was discovering all the nodes of the other clusters that had that tag, AND, killing them, after the timeout: 15 minutes.

    So the lesson here is, each cluster-autoscaler must be started like this:

    command:
                - ./cluster-autoscaler
                - --v=4
                - --stderrthreshold=info
                - --cloud-provider=aws
                - --skip-nodes-with-local-storage=false
                - --expander=least-waste
                - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/clusterName
    

    And all the scaling groups need to be tagged accordingly.