Confusion regarding Kubernetes pod status metrics

This is realy two questions in one - I think they are related.

What does the `kube_pod_status_phase` metric value represent?

When I view the kube_pod_status_phase metric in Prometheus, the metric value is always a 0 or 1, but it's not clear to me what 0 and 1 means. Let's use an example. The query below returns the value of this metric where the "phase" label equals "Running".

Query:

kube_pod_status_phase{phase="Running"}

Result: (sample)

kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="argocd", phase="Running", pod="argocd-server-6f8487c84d-5qqv7", service="prometheus-kube-state-metrics", uid="ee84e48d-0302-4f5a-9e81-f4f0d7d0223f"}
1
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="default", phase="Running", pod="rapid7-monitor-799d9f9898-fst5q", service="prometheus-kube-state-metrics", uid="1561cd66-b5c4-48b9-83d0-11f4f1f0d5d9"}
1
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Running", pod="clean-deploy-cronjob-28112310-ljws6", service="prometheus-kube-state-metrics", uid="5510f859-74ca-471f-9c50-c1b8976119f3"}
0
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Running", pod="clean-deploy-cronjob-28113750-75m8v", service="prometheus-kube-state-metrics", uid="d63e5038-a8bb-4f88-bd77-82c66d183e1b"}
0

Why do some "running" pods have a value of 0, while others have a value of 1? Are the items with a value of 1 "currently" running (at the time the query was run) and the items with a value of 0 "had been" running, but are no longer?

There seem to be inconsistencies with what the `kube_pod_status_phase` metric produces between Prometheus and Grafana. Why?

If I use a slightly different version of the query above, I get different results between Prometheus and what is shown in Grafana.

Query:

kube_pod_status_phase{phase=~"Pending"} != 0

Result: (Prometheus}

empty query result

Result: (Grafana table view)

pod                                 namespace   phase
clean-deploy-cronjob-28115190-2rhv5 deploy      Pending

If I go back to Prometheus and focus on that pod specifically:

Query:

kube_pod_status_phase{pod="clean-deploy-cronjob-28115190-2rhv5"}

Result:

kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Failed", pod="clean-deploy-cronjob-28115190-2rhv5", service="prometheus-kube-state-metrics", uid="4dd948f6-327b-4c00-abc9-57d16bd588d0"}
0
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Pending", pod="clean-deploy-cronjob-28115190-2rhv5", service="prometheus-kube-state-metrics", uid="4dd948f6-327b-4c00-abc9-57d16bd588d0"}
0
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Running", pod="clean-deploy-cronjob-28115190-2rhv5", service="prometheus-kube-state-metrics", uid="4dd948f6-327b-4c00-abc9-57d16bd588d0"}
0
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Succeeded", pod="clean-deploy-cronjob-28115190-2rhv5", service="prometheus-kube-state-metrics", uid="4dd948f6-327b-4c00-abc9-57d16bd588d0"}
1
kube_pod_status_phase{container="kube-state-metrics", endpoint="http", instance="10.244.211.138:8080", job="kube-state-metrics", namespace="deploy", phase="Unknown", pod="clean-deploy-cronjob-28115190-2rhv5", service="prometheus-kube-state-metrics", uid="4dd948f6-327b-4c00-abc9-57d16bd588d0"}
0

Notice that the entry with phase "Running" has a value of 0, while the entry with a value of 1 has the phase "Succeeded". You could argue that the status changed during the period when I ran these queries. No, it has not. It has been showing these results for a long time.

This is just one example of strange inconsistencies I've seen between a query run in Prometheus vs. Grafana.

UPDATE:

I think I have gained some insight into the inconsistencies question. When I run the query in Prometheus it gives me the results as of "now" (a guess on my part). In Grafana, it takes into account the "time window" that's available in the dashboard header. When I dialed it back to "the last 5 minutes", the pending entry disappeared.

I see that there is an option at the dashboard level in Grafana to hide the time picker, which if set to hide, hides not only the time picker, but also the refresh period selector. If this option is used, I'm curious as to how often the dashboard is actually refreshed. Should I use this to effectively make Grafana only care about "now", instead of some time window into the past?

Solution

What does the kube_pod_status_phase metric value represent?

kube_pod_status_phase contains a set of metrics for every pod with label phase being set to "Failed", "Pending", "Running", "Succeeded", "Unknown".

Only one of those metrics (for every pod) will have value 1. It means that pod is in corresponding phase.

Why do some "running" pods have a value of 0, while others have a value of 1?

Remember, that Prometheus is not real time solution. It has values only with resolution of scrape_interval. Check suspicious pods for other states, it's quite possible, that pod's state wasn't updated. Plus, for short-lived pods all kinds of strange behavior in metrics is possible.

There seem to be inconsistencies with what the kube_pod_status_phase metric produces between Prometheus and Grafana. Why?

Most likely your query in Grafana has type "Range" or "Both" and in table mode it shows all values over time range selected for dashboard.

If you only want to see last values (according to "To" value of dashboard time range), you can go to query options (under query in panel edit mode) and set type to "Instant".

I see that there is an option at the dashboard level in Grafana to hide the time picker, which if set to hide, hides not only the time picker, but also the refresh period selector. If this option is used, I'm curious as to how often the dashboard is actually refreshed. Should I use this to effectively make Grafana only care about "now", instead of some time window into the past?

No. This is for other uses. For example for presentation mode.

Confusion regarding Kubernetes pod status metrics

What does the kube_pod_status_phase metric value represent?

There seem to be inconsistencies with what the kube_pod_status_phase metric produces between Prometheus and Grafana. Why?

What does the kube_pod_status_phase metric value represent?

There seem to be inconsistencies with what the kube_pod_status_phase metric produces between Prometheus and Grafana. Why?

What does the `kube_pod_status_phase` metric value represent?

There seem to be inconsistencies with what the `kube_pod_status_phase` metric produces between Prometheus and Grafana. Why?

What does the `kube_pod_status_phase` metric value represent?