kubernetes prometheus metrics promql prometheus-alertmanager

Is there any way to DRY repetitive PromQL query fragments like info-metrics group_left label matchers (joins)

I have a number of queries for alerts and dashboard use that have an incredible amount of copy/paste boilerplate for filtering and enriching with labels.

Is there no way to save and re-use this repetitive PromQL server-side, like views in SQL databases? Server-side functions, macros, ... anything?

(I know I can "save" the query as a recording rule, but this is incredibly inefficient when joining on labels. The recording rule has to expose every label anyone might want, producing painfully high cardinality and expensive info-metrics. It wastes storage and memory, and "solving" it with "just add another TB of RAM for Prometheus" is fashionable in cloud, but incredibly wasteful.)

Consider this query using kube-state-metrics data:

sum without(job,instance,service,endpoint,metrics_path,prometheus) (
  kubelet_volume_stats_available_bytes{kube_cluster="$kube_cluster"}
  # enrich with labels
  * on (namespace,persistentvolumeclaim)
    group_left(some_org_specific_label,other_org_specific_label)
    group by (namespace,persistentvolumeclaim,some_org_specific_label,other_org_specific_label) (
      kube_persistentvolumeclaim_labels{kube_cluster="$kube_cluster"}
  )
  * on (namespace,persistentvolumeclaim)
    group_left(persistentvolume)
    group by (namespace,persistentvolumeclaim,persistentvolume) (
    # for some reason kube_persistentvolumeclaim_info uses label "volumename"
    # and persistentvolume_info uses "persistentvolume"
    label_replace(
      kube_persistentvolumeclaim_info{kube_cluster="$kube_cluster"},
      "persistentvolume", "$1", "volumename", "^(.*)$"
    )
  )
  * on (persistentvolume)
    group_left(csi_driver,csi_volume_handle,storageclass)
    group by (persistentvolume,csi_driver,csi_volume_handle,storageclass) (
      kube_persistentvolume_info
  )
)

This just says:

query kubelet_volume_stats_available_bytes for values on kube_cluster="$cluster"
join on kube_persistentvolumeclaim_labels to add labels some_org_specific_label and other_org_specific_label
join on kube_persistentvolumeclaim_info to find the persistentvolume name associated with the persistentvolumeclaim (handling label name inconsistency)
join on kube_persistentvolume_info to find the persistent volume's CSI ID and details
discard unwanted labels from the result

It's ugly, but it's not that bad... until you also want to write another query for disk space percentage free, and another for I/O thresholds, and so on. Each of which repeats all the same boilerplate.

I seem to need to push-down filter criteria manually for an efficient query too, Prometheus doesn't seem capable of anything like SQL engines' filter push-down logic.

And that's a short example. Here's another, getting a workload metric enriched with some kube pod labels, a kube pod annotation, and the running container image:

# This aggregation drops unwanted labels, since PromQL lacks a proper label_drop(...) operator to drop non-cardinal labels
sum without(endpoint,instance,job,prometheus,container,uid) (
    some_workload_specific_metric{kube_cluster="$kube_cluster"}
    # join on kube_pod_labels for project-id, PGD info, etc
    * on (uid)
    group_left(org_specific_label_1, org_specific_label_2, org_specific_annotation_1)
    # note the group_by (...) expression repeats the labels in both the on (...) join key and
    # the subject-labels in group_left(...). This protects against issues where added or unrelated
    # labels that aren't of interest can churn. It's probably safe to write
    # group ignoring(container,instance,job=)
    # in this case, but better to make the query robust:
    group by (uid, org_specific_label_1, org_specific_label_2) (
        kube_pod_labels{kube_cluster="$kube_cluster"}
    )
    # join on kube_pod_info for the node hosting the pod and the pod ip address
    * on (uid)
    group_left(pod_ip,node)
    group by (uid, pod_ip, node) (
        kube_pod_info{kube_cluster="$kube_cluster"}
    )
    # join on kube_pod_container_info for the container image. Note that we join on container_id too
    * on (uid,container_id)
    group_left(image_spec,image_id)
    group by (uid,container_id,image_spec,image_id) (
        kube_pod_container_info{kube_cluster="$kube_cluster"}
    )
    # join on kube_pod_annotations for org_specific_annotation_1, if any
    * on (uid)
    group_left(org_specific_annotation_1)
    group by (uid,org_specific_annotation_1) (
        kube_pod_annotations{kube_cluster="$kube_cluster"}
    )
)

Imagine reusing that for every (say) alertmanager query you want to fire with those labels exposed in it...

Is there a saner way to do this? In a SQL database I'd just CREATE VIEW and join on it. Does everyone just use recording rules and pay the huge blow-out price grabbing and recording all possible labels then discarding them most of the time?

Solution

Prometheus doesn't support common table expression-like functionality (aka CTE). If you need such functionality, try MetricsQL at VictoriaMetrics - this is PromQL-like query language, which supports e.g. WITH expressions. For example, if you want re-using the same query for different metric names, then you can put this query into WITH template function, which accepts metric name, and then call this function with different metric names when needed:

WITH (
  f(m) = some_complex_query_here
)
(
  f(foo),  # expand the query with `foo` metric
  f(bar),  # expand the query with `bar` metric
)

Disclaimer: I'm the author of MetricsQL.