Search code examples
google-cloud-platformgoogle-kubernetes-enginegoogle-managed-prometheus

Google Managed Prometheus not collecting metrics from GKE cronjobs


I'm using Google's managed collection on my GKE cluster (v1.24.26) and I can't find a way to collect metrics related to Kubernetes cronjobs. I can't find kube_cronjob_next_schedule_time, kube_job_status_failed nor kube_job_status_succeeded.

Do I need to configure something specific to gather this metrics on GKE?

I tried restarting kube-state-metrics-0, restarting the collectors, nothing worked.


Solution

  • Ok, this threw me too.

    I realized (belatedly) that kube-state-metrics creates both a PodMonitoring and ClusterPodMonitoring.

    The PodMonitoring resource exposes metrics published by the Pod created by statefulset.apps/kube-state-metrics on the Pod's metric-self port (8081). The ClusterPodMonitoring exposes metrics published on the Pod's metric port (8080) but this doesn't include cronjob-related metrics:

    kubectl get clusterpodmonitoring/kube-state-metrics \
    --output=jsonpath="{.spec.endpoints[0].metricRelabeling[0]}" \
    | jq -r .
    
    {
      "action": "keep",
      "regex": "kube_(daemonset|deployment|replicaset|pod|namespace|node|statefulset|persistentvolume|horizontalpodautoscaler|job_created)(_.+)?",
      "sourceLabels": [
          "__name__"
      ]
    }
    

    NOTE The regex does not include kube_cronjob and only includes kube_job_created patterns.

    You will need to add a regex for kube_cronjob and kube_job metrics that you want in addition.

    One way (!) to do this after you've deployed Kube State Metrics, is to kubectl patch the clusterpodmonitoring resource.

    Of course, a better approach is to edit the Google-provided YAML (kube-state-metrics.yaml#L324) before you Install Kube State Metrics.

    VALUE="kube_(cronjob|daemonset|deployment|job|replicaset|pod|namespace|node|statefulset|persistentvolume|horizontalpodautoscaler)(_.+)?"
    
    PATCH="
    [
        {
            'op':'replace',
            'path': '/spec/endpoints/0/metricRelabeling/0/regex',
            'value':'${VALUE}'
        }
    ]"
    
    kubectl patch clusterpodmonitoring/kube-state-metrics \
    --type=json \
    --patch="${PATCH}"
    

    NOTE This (VALUE) includes 2 changes:

    • Adds all kube_cronjob_* metrics
    • Adds all kube_job_* metrics (removing the redundant kube_job_created_* metrics)

    You can demonstrate that the metrics are now scraped by Cloud Monitoring using metrics explorer and PromQL or native MQL (prometheus.googleapis.com/kube_cronjob_next_schedule_time/gauge) or using APIs Explorer for Cloud Monitoring's Prometheus API:

    PROJECT="..." # Your Project ID
    ENDPOINT="https://monitoring.googleapis.com/v1/projects/${PROJECT}/location/global/prometheus/api/v1/query"
    
    TOKEN="$(gcloud auth print-access-token)"
    
    METRIC="kube_cronjob_next_schedule_time"
    
    curl \
    --silent \
    --request POST \
    --header "Authorization: Bearer ${TOKEN}" \
    --header "Accept: application/json"   \
    --header "Content-Type: application/json"   \
    --data "{\"query\":\"${METRIC}\"}" \
    ${ENDPOINT} \
    | jq -r .
    
    {
      "status": "success",
      "data": {
        "resultType": "vector",
        "result": [
          {
            "metric": {
              "__name__": "kube_cronjob_next_schedule_time",
              "cluster": "...",
              "cronjob": "hello",
              "instance": "kube-state-metrics-0:metrics",
              "job": "kube-state-metrics",
              "location": "...",
              "namespace": "test",
              "project_id": "..."
            },
            "value": [
              1703893639.8,
              "1703893680"
            ]
          }
        ]
      }
    }
    

    NOTE In this case I'd created a CronJob called hello in test namespace.