Search code examples
azure-container-service

My AKS Cluster was brought down, how can I recover?


I have been playing around with load-testing my application on a single agent cluster in AKS. During the testing, the connection to the dashboard stalled and never resumed. My application seems down as well, so I am assuming the cluster is in a bad state.

The API server is restate-f4cbd3d9.hcp.centralus.azmk8s.io

kubectl cluster-info dump shows the following error:

{
    "name": "kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "namespace": "kube-system",
    "selfLink": "/api/v1/namespaces/kube-system/events/kube-dns-v20-6c8f7f988b-9wpx9.14fbbbd6bf60f0cf",
    "uid": "47f57d3c-d577-11e7-88d4-0a58ac1f0249",
    "resourceVersion": "185572",
    "creationTimestamp": "2017-11-30T02:36:34Z",
    "InvolvedObject": {
        "Kind": "Pod",
        "Namespace": "kube-system",
        "Name": "kube-dns-v20-6c8f7f988b-9wpx9",
        "UID": "9d2b20f2-d3f5-11e7-88d4-0a58ac1f0249",
        "APIVersion": "v1",
        "ResourceVersion": "299",
        "FieldPath": "spec.containers{kubedns}"
    },
    "Reason": "Unhealthy",
    "Message": "Liveness probe failed: Get http://10.244.0.4:8080/healthz-kubedns: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)",
    "Source": {
        "Component": "kubelet",
        "Host": "aks-agentpool-34912234-0"
    },
    "FirstTimestamp": "2017-11-30T02:23:50Z",
    "LastTimestamp": "2017-11-30T02:59:00Z",
    "Count": 6,
    "Type": "Warning"
}

As well as some Pod Sync errors in Kube-System.

Example of issue:

az aks browse -g REstate.Server -n REstate

Merged "REstate" as current context in C:\Users\User\AppData\Local\Temp\tmp29d0conq

Proxy running on http://127.0.0.1:8001/
Press CTRL+C to close the tunnel...
error: error upgrading connection: error dialing backend: dial tcp 10.240.0.4:10250: getsockopt: connection timed out

Solution

  • You'll probably need to ssh to the node to see if the Kubelet service is running. For future you can set Resource quotas from exhausting all resources in the cluster nodes.

    Resource Quotas -https://kubernetes.io/docs/concepts/policy/resource-quotas/