Search code examples
kubernetesgoogle-cloud-platformgoogle-kubernetes-enginestackdrivergoogle-cloud-monitoring

GCP Alerting Policy for failed GKE CronJob


What would be the best way to set up a GCP monitoring alert policy for a Kubernetes CronJob failing? I haven't been able to find any good examples out there.

Right now, I have an OK solution based on monitoring logs in the Pod with ERROR severity. I've found this to be quite flaky, however. Sometimes a job will fail for some ephemeral reason outside my control (e.g., an external server returning a temporary 500) and on the next retry, the job runs successfully.

What I really need is an alert that is only triggered when a CronJob is in a persistent failed state. That is, Kubernetes has tried rerunning the whole thing, multiple times, and it's still failing. Ideally, it could also handle situations where the Pod wasn't able to come up either (e.g., downloading the image failed).

Any ideas here?

Thanks.


Solution

  • First of all, confirm the GKE’s version that you are running. For that, the following commands are going to help you to identify the GKE’s default version and the available versions too:

    Default version.

    gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
        --format="yaml(channels.channel,channels.defaultVersion)"
    

    Available versions.

    gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
        --format="yaml(channels.channel,channels.validVersions)"
    

    Now that you know your GKE’s version and based on what you want is an alert that is only triggered when a CronJob is in a persistent failed state, GKE Workload Metrics was the GCP’s solution that used to provide a fully managed and highly configurable solution for sending to Cloud Monitoring all Prometheus-compatible metrics emitted by GKE workloads (such as a CronJob or a Deployment for an application). But, as it is right now deprecated in G​K​E 1.24 and was replaced with Google Cloud Managed Service for Prometheus, then this last is the best option you’ve got inside of GCP, as it lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.

    Plus, you have 2 options from the outside of GCP: Prometheus as well and Ranch’s Prometheus Push Gateway.

    Finally and just FYI, it can be done manually by querying for the job and then checking it's start time, and compare that to the current time, this way, with bash:

    START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
    echo $START_TIME
    

    Or, you are able to get the job’s current status as a JSON blob, as follows:

    kubectl -n=your-namespace get job your-job-name -o json | jq '.status'
    

    You can see the following thread for more reference too.

    Taking the “Failed” state as the medullary point of your requirement, setting up a bash script with kubectl to send an email if you see a job that is in “Failed” state can be useful. Here I will share some examples with you:

    while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}' | grep True`; then mail email@address -s jobfailed; else sleep 1 ; fi; done
    

    For newer K8s:

    while true; do kubectl wait --for=condition=failed job/myjob; mail@address -s jobfailed; done