Search code examples
google-kubernetes-engineetcd

GKE: how is etcd compaction handled when api server is down / what happens when etcd is full


In our current clusters we have an emergency etcd compaction script that prevents etcd from locking up. We are looking into moving to GKE and wondering if it comes with something similar out of the box or what exactly happens when etcd gets full.


Solution

  • In general terms GKE is a managed service, specially the Control Plane, as part of a fully managed product by Google's SREs Site Reliability Engineering.

    With this being said, GKE control plane and its operations are part of Google duties in which you will not participate, this in order to get back time to focus on your application, while Google's SREs monitor your cluster and its computing, networking, and storage resources.

    To answer your question if GKE comes with something similar (etcd compaction script) out of the box or what exactly happens when etcd gets full:

    Not sure if there is a solution as yours (etcd compaction script) in GKE, but if it exist, it will be managed by the Google's SREs and you will not (depending on the cluster type you choose) notice if they backup or give maintenance to etcd o algun otro control plane component.

    In my experience most common issues related to a full etcd has to do with jobs not being deleted. As we know when a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The Job object also remains after it is completed so that you can view its status. It is up to you to delete old jobs after noting their status. When things like this happens etcd database can be overwhelmed with that amount of data and therefore becomes unresponsive (This completely depends on the number of jobs running on your cluster).

    In case etcd stops working or gets full, Google will be in charge of fixing it, as I mentioned above you will notice downtime on the control plane depending on the cluster type you choose. GKE offers Zonal clusters (single replica of the control plane running in a single zone), Multi-zonal clusters (single replica of the control plane running in a single zone), Regional cluster (multiple replicas of the control plane, running in multiple zones within a given region). If you choose Regional GKE cluster you will have HA for your GKE control plane (3 replicas of each control plane resource).