Search code examples
kubernetesprometheusgrafana

Kubernetes pod/container status and memory/CPU statistics


I am using Prometheus to scrape metrics from our Kubernetes cluster and Grafana to visualize the information. I am trying to set up a view in Grafana that will display the following information:

  • The current status of pods and containers within them. There appear to be two separate states, one for pod and one for container. Not sure whether I should try to show both, or just one.
  • It might also be useful to display creation date and/or age (stretch goal).
  • The working set memory of each container (for those that are "running"). I'm trying to get an idea of the amount of memory "in-use". For a pod/container that's in an "incompatible" state, display blank (or N/A).
  • For containers that have specified memory requests and limits, the amounts specified. For containers without either of these, display blank (or N/A).

Note: The queries may be filtered by node (multi-select or "all").

[Optionally, I would also like to include CPU information (usage, requests/limits) to this table, but if it makes more sense or is easier to separate into a separate table view, that is totally fine.]

Here is a hypothetical table header layout:

node namespace pod container pod state container state memory memory req memory limit

I have been experimenting with getting information from various metrics from a couple of different metrics "families". I'm discovering that certain information (e.g. memory statistics) is available from one family, while other information (e.g. pod/container status) is available from another family. It's difficult to "merge" the information from metrics from these two families together.

Here are some of the observations I've made during this work:

  • kube_pod_container... metrics have information about pods/containers and their status, but no information about memory (or CPU) usage.
  • container_memory_... metrics have information about memory statistics, but there's no corresponding info about status. I'm assuming there's a similar set of metrics named container_cpu... that provide CPU statistics.
  • The information returned from container_memory_working_set_bytes includes data with distinct variations in the label dimensions. Not sure what this means. It seems odd that there would not be consistent labels across the result set. I have not looked at other similar metrics to see if this phenomenon occurs there as well.

I'm struggling with coming up with a single table view that accurately shows even a subset of what I described above. I get close, but there's always something wrong with the result.

I could really use some help, ideas and suggestions on how I can accomplish this. Perhaps I'm trying to do too much and need to break it apart. There does not appear to be information on the Internet that gives a good idea of even the appropriate metrics to try to use.


Solution

  • Here is a solution I came up with that is satisfying (most of) the requirements.

    4 queries:

    label_replace(
      label_join(
        label_replace(
            {__name__=~"(kube_pod_container_status_running|kube_pod_container_status_terminated|kube_pod_container_status_waiting)"}
          ==
            1,
          "status",
          "$1",
          "__name__",
          "kube_pod_container_status_(.*)"
        ),
        "pod_container",
        "__",
        "pod",
        "container"
      ),
      "__name__",
      "kube_pod_container_status",
      "__name__",
      ".*"
    )
    
    label_join(
      node_namespace_pod_container:container_memory_working_set_bytes,
      "pod_container",
      "__",
      "pod",
      "container"
    )
    
    label_join(
      cluster:namespace:pod_memory:active:kube_pod_container_resource_requests,
      "pod_container",
      "__",
      "pod",
      "container"
    )
    
    label_join(
      cluster:namespace:pod_memory:active:kube_pod_container_resource_limits,
      "pod_container",
      "__",
      "pod",
      "container"
    )
    

    The first query combines information from 3 separate metrics into a pseudo metric named kube_pod_container_status. The derived status label provides the actual status for a given result entry.

    All queries concatenate the pod and container labels together into a new label pod_container. This is used to join them all (i.e. the "Join by field" transformation) by this label. I then make use of the "Organize fields" transformation to hide columns I don't want (many of which are duplicates), change labels and reorder columns.