No Pods reachable or schedulable on kubernetes cluster

I have 2 kubernetes clusters in the IBM cloud, one has 2 Nodes, the other one 4.

The one that has 4 Nodes is working properly but at the other one I had to temporarily remove the worker nodes due to monetary reasons (shouldn't be payed while being idle).

When I reactivated the two nodes, everything seemed to start up fine and as long as I don't try to interact with Pods it still looks fine on the surface, no messages about inavailability or critical health status. OK, I deleted two obsolete Namespaces which got stuck in the Terminating state, but I could resolve that issue by restarting a cluster node (don't exactly know anymore which one it was).

When everything looked ok, I tried to access the kubernetes dashboard (everything done before was on IBM management level or in the command line) but surprisingly I found it unreachable with an error page in the browser stating:

503: Service Unavailable

There was a small JSON message at the bottom of that page, which said:

  "kind": "Status",
  "apiVersion": "v1",
  "metadata": { },
  "status": "Failure",
  "message": "error trying to reach service: read tcp\u003e172.19.151.38:8090: read: connection reset by peer",
  "reason": "ServiceUnavailable",
  "code": 503

I sent a kubectl logs kubernetes-dashboard-54674bdd65-nf6w7 --namespace=kube-system where the Pod was shown as running, but the result was not logs to be viewed, it was this message instead:

Error from server: Get "":
read tcp>
read: connection reset by peer

Then I found out I'm neither able to get the logs of any Pod running in that cluster, nor am I able to deploy any new custom kubernetes object that requires scheduling (I actually could apply Services or ConfigMaps, but no Pod, ReplicaSet, Deployment or similar).

I already tried to

  • reload the worker nodes in the workerpool
  • restart the worker nodes in the workerpool
  • restarted the kubernetes-dashboard Deployment

Unfortunately, none of the above actions changed the accessibility of the Pods.

There's another thing that might be related (though I'm not quite sure it actually is):

In the other cluster that runs fine, there are three calico Pods running and all three are up while in the problematic cluster only 2 of the three calico Pods are up and running, the third one stays in Pending state and a kubectl describe pod calico-blablabla-blabla reveals the reason, an Event

Warning  FailedScheduling  13s   default-scheduler
0/2 nodes are available: 2 node(s) didn't have free ports for the requested pod ports.

Does anyone have a clue about what's going on in that cluster and can point me to possible solutions? I don't really want to delete the cluster and spawn a new one but I cannot use the user interfaces (dashboard or cli).


The result of kubectl describe pod kubernetes-dashboard-54674bdd65-4m2ch --namespace=kube-system:

Name:                 kubernetes-dashboard-54674bdd65-4m2ch
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Start Time:           Mon, 15 Nov 2021 09:01:30 +0100
Labels:               k8s-app=kubernetes-dashboard
Annotations: ca52cefaae58d8e5ce6d54883cb6a6135318c8db53d231dc645a5cf2e67d821e
Status:               Running
Controlled By:  ReplicaSet/kubernetes-dashboard-54674bdd65
    Container ID:  containerd://bac57850055cd6bb944c4d893a5d315c659fd7d4935fe49083d9ef8ae03e5c31
    Image ID:
    Port:          8443/TCP
    Host Port:     0/TCP
    State:          Running
      Started:      Mon, 15 Nov 2021 09:01:37 +0100
    Ready:          True
    Restart Count:  0
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get https://:8443/ delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/ delay=10s timeout=30s period=10s #success=1 #failure=3
    Environment:  <none>
      /certs from kubernetes-dashboard-certs (rw)
      /tmp from tmp-volume (rw)
      /var/run/secrets/ from kube-api-access-sc9kw (ro)
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubernetes-dashboard-certs
    Optional:    false
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    SizeLimit:  <unset>
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
                    op=Exists for 600s
                    op=Exists for 600s
Events:                      <none>


  • Problem resolved…

    The cause of the problem was an update of the cluster to the kubernetes version 1.21 while my cluster was meeting the following conditions:

    • private and public service endpoint enabled
    • VRF disabled

    Root cause:

    In Kubernetes version 1.21, Konnectivity replaces OpenVPN as the network proxy that is used to secure the communication of the Kubernetes API server master to worker nodes in the cluster.
    When using Konnectivity, a problem exists with masters to cluster nodes communication when all of the above mentioned conditions are met.

    Solution steps:

    • disabled the private service endpoint (the public one seems not to be a problem) by using the command
      ibmcloud ks cluster master private-service-endpoint disable --cluster <CLUSTER_NAME> (this command is provider specific, if you are experiencing the same problem with a different provider or on a local installation, find out how to disable that private service endpoint)
    • refreshed the cluster master using ibmcloud ks cluster master refresh --cluster <CLUSTER_NAME> and finally
    • reloaded all the worker nodes (in the web console, should be possible through a command as well)
    • waited for about 30 minutes:
      • Dashboard available / reachable again
      • Pods accessible and schedulable again

    General recommendation:

    BEFORE you update any cluster to kubernetes 1.21, check if you have enabled the private service endpoint. If you have, either disable it or delay the update until you can, or enable VRF (virtual routing and forwarding), which I couldn't but was told it was likely to resolve the issue.