google-cloud-platform google-kubernetes-engine google-managed-prometheus

Google Managed Prometheus not collecting metrics from GKE cronjobs

I'm using Google's managed collection on my GKE cluster (v1.24.26) and I can't find a way to collect metrics related to Kubernetes cronjobs. I can't find kube_cronjob_next_schedule_time, kube_job_status_failed nor kube_job_status_succeeded.

Do I need to configure something specific to gather this metrics on GKE?

I tried restarting kube-state-metrics-0, restarting the collectors, nothing worked.

Solution

Ok, this threw me too.

I realized (belatedly) that kube-state-metrics creates both a PodMonitoring and ClusterPodMonitoring.

The PodMonitoring resource exposes metrics published by the Pod created by statefulset.apps/kube-state-metrics on the Pod's metric-self port (8081). The ClusterPodMonitoring exposes metrics published on the Pod's metric port (8080) but this doesn't include cronjob-related metrics:

kubectl get clusterpodmonitoring/kube-state-metrics \
--output=jsonpath="{.spec.endpoints[0].metricRelabeling[0]}" \
| jq -r .

{
  "action": "keep",
  "regex": "kube_(daemonset|deployment|replicaset|pod|namespace|node|statefulset|persistentvolume|horizontalpodautoscaler|job_created)(_.+)?",
  "sourceLabels": [
      "__name__"
  ]
}

NOTE The regex does not include kube_cronjob and only includes kube_job_created patterns.

You will need to add a regex for kube_cronjob and kube_job metrics that you want in addition.

One way (!) to do this after you've deployed Kube State Metrics, is to kubectl patch the clusterpodmonitoring resource.

Of course, a better approach is to edit the Google-provided YAML (kube-state-metrics.yaml#L324) before you Install Kube State Metrics.

VALUE="kube_(cronjob|daemonset|deployment|job|replicaset|pod|namespace|node|statefulset|persistentvolume|horizontalpodautoscaler)(_.+)?"

PATCH="
[
    {
        'op':'replace',
        'path': '/spec/endpoints/0/metricRelabeling/0/regex',
        'value':'${VALUE}'
    }
]"

kubectl patch clusterpodmonitoring/kube-state-metrics \
--type=json \
--patch="${PATCH}"

NOTE This (VALUE) includes 2 changes:

Adds all kube_cronjob_* metrics
Adds all kube_job_* metrics (removing the redundant kube_job_created_* metrics)

You can demonstrate that the metrics are now scraped by Cloud Monitoring using metrics explorer and PromQL or native MQL (prometheus.googleapis.com/kube_cronjob_next_schedule_time/gauge) or using APIs Explorer for Cloud Monitoring's Prometheus API:

PROJECT="..." # Your Project ID
ENDPOINT="https://monitoring.googleapis.com/v1/projects/${PROJECT}/location/global/prometheus/api/v1/query"

TOKEN="$(gcloud auth print-access-token)"

METRIC="kube_cronjob_next_schedule_time"

curl \
--silent \
--request POST \
--header "Authorization: Bearer ${TOKEN}" \
--header "Accept: application/json"   \
--header "Content-Type: application/json"   \
--data "{\"query\":\"${METRIC}\"}" \
${ENDPOINT} \
| jq -r .

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "kube_cronjob_next_schedule_time",
          "cluster": "...",
          "cronjob": "hello",
          "instance": "kube-state-metrics-0:metrics",
          "job": "kube-state-metrics",
          "location": "...",
          "namespace": "test",
          "project_id": "..."
        },
        "value": [
          1703893639.8,
          "1703893680"
        ]
      }
    ]
  }
}

NOTE In this case I'd created a CronJob called hello in test namespace.