I am using Prometheus to scrape metrics from our Kubernetes cluster and Grafana to visualize the information. I am trying to set up a view in Grafana that will display the following information:
Note: The queries may be filtered by node (multi-select or "all").
[Optionally, I would also like to include CPU information (usage, requests/limits) to this table, but if it makes more sense or is easier to separate into a separate table view, that is totally fine.]
Here is a hypothetical table header layout:
node | namespace | pod | container | pod state | container state | memory | memory req | memory limit |
---|
I have been experimenting with getting information from various metrics from a couple of different metrics "families". I'm discovering that certain information (e.g. memory statistics) is available from one family, while other information (e.g. pod/container status) is available from another family. It's difficult to "merge" the information from metrics from these two families together.
Here are some of the observations I've made during this work:
kube_pod_container...
metrics have information about pods/containers and their status, but no information about memory (or CPU) usage.container_memory_...
metrics have information about memory statistics, but there's no corresponding info about status. I'm assuming there's a similar set of metrics named container_cpu...
that provide CPU statistics.container_memory_working_set_bytes
includes data with distinct variations in the label dimensions. Not sure what this means. It seems odd that there would not be consistent labels across the result set. I have not looked at other similar metrics to see if this phenomenon occurs there as well.I'm struggling with coming up with a single table view that accurately shows even a subset of what I described above. I get close, but there's always something wrong with the result.
I could really use some help, ideas and suggestions on how I can accomplish this. Perhaps I'm trying to do too much and need to break it apart. There does not appear to be information on the Internet that gives a good idea of even the appropriate metrics to try to use.
Here is a solution I came up with that is satisfying (most of) the requirements.
4 queries:
label_replace(
label_join(
label_replace(
{__name__=~"(kube_pod_container_status_running|kube_pod_container_status_terminated|kube_pod_container_status_waiting)"}
==
1,
"status",
"$1",
"__name__",
"kube_pod_container_status_(.*)"
),
"pod_container",
"__",
"pod",
"container"
),
"__name__",
"kube_pod_container_status",
"__name__",
".*"
)
label_join(
node_namespace_pod_container:container_memory_working_set_bytes,
"pod_container",
"__",
"pod",
"container"
)
label_join(
cluster:namespace:pod_memory:active:kube_pod_container_resource_requests,
"pod_container",
"__",
"pod",
"container"
)
label_join(
cluster:namespace:pod_memory:active:kube_pod_container_resource_limits,
"pod_container",
"__",
"pod",
"container"
)
The first query combines information from 3 separate metrics into a pseudo metric named kube_pod_container_status
. The derived status
label provides the actual status for a given result entry.
All queries concatenate the pod
and container
labels together into a new label pod_container
. This is used to join them all (i.e. the "Join by field" transformation) by this label. I then make use of the "Organize fields" transformation to hide columns I don't want (many of which are duplicates), change labels and reorder columns.