Search code examples
prometheusgrafanapromql

What is the correct way to use Prometheus vector matching with wildcard label matchers?


I am trying to match two metrics that have a one to many relationship. The first metric webrtc_metrics_audio_outbound_rtp_bytes_sent maps to many of the second metric relay_node_audio_track_bytes. That is to say that for each audio outbound rtp stream, there are many relay nodes consuming the stream. Each stream has a session_id that I'm trying to match across the metrics while also retaining the pod_name that is specific to each relay node.

I'm using comparison operators with bool modifiers as my intent is to set up an alert based on these two metrics. The alert should fire whenever there is non zero data for webrtc_metrics_audio_outbound_rtp_bytes_sent but corresponding zero data on relay_node_audio_track_bytes for the same session_id.

Here is my attempt using the following query and the corresponding output in Grafana:

((sum by (session_id, pod_name) (rate(relay_node_audio_track_bytes{pod_name=~"$pod",session_id=~"$session"}[$__rate_interval]))) == bool 0)
    * on (session_id) group_left(pod_name) 
((sum by (session_id) (label_replace(rate(webrtc_metrics_audio_outbound_rtp_bytes_sent{app="capturer",id=~"$session"}[$__rate_interval]), "session_id","$1","id","(.*)"))) > bool 0) > 0

You can see in the first graph when a Pod is selected in the dropdown, the query works as intended. But when I try to use a wildcard to query for all pods I receive there error: execution: multiple matches for labels: grouping labels must ensure unique matches

OK query with Pod label matcher pinned

enter image description here

Here are the left and right sides of the query showing all the labels in each metric. Note I used label_replace in the vector matching query to rename id to session_id.

LHS: (rate(playback_relay_node_audio_track_bytes{pod_name=~"$pod",session_id=~"$session"}[$__rate_interval]))

enter image description here

RHS: label_replace(rate(webrtc_metrics_audio_outbound_rtp_bytes_sent{app="capturer",id=~"$session"}[$__rate_interval]), "session_id","$1","id","(.*)")

enter image description here

Can somebody please explain why selecting a specific Pod does not throw the same error as when using a wildcard label matcher? Is there some other labeling methods I need to use to get this working as intended? Ideally I'd like to see this boolean condition plotted across all pods and session ids. Thanks!


Solution

  • Try removing the pod_name from group_left() modifier:

    ((sum by (session_id, pod_name) (rate(relay_node_audio_track_bytes{pod_name=~"$pod",session_id=~"$session"}[$__rate_interval]))) == bool 0)
        * on (session_id) group_left() 
    ((sum by (session_id) (label_replace(rate(webrtc_metrics_audio_outbound_rtp_bytes_sent{app="capturer",id=~"$session"}[$__rate_interval]), "session_id","$1","id","(.*)"))) > bool 0) > 0
    

    Prometheus leaves all the labels from the left side after applying the * operator (or any other operator) if group_left() modifier is used. E.g. the original pod_name values from the left side are left in results after calculating the * with group_left() modifier.

    The list of labels inside the group_left() modifier is taken from the matching time series on the right side of * . In this case time series returned from the right side of * have no the pod_name label. That's why the original values for this label obtained from the left side are substituted with empty values from the right side, e.g. they are effectively deleted. This may result in duplicate time series error when the same session_id value is present in multiple time series with different pod_name values at the left side of *.

    See more details in the official docs.