My distributed server application emits a gauge for each active client. I want to see which clients are active against 2 nodes simultaneously. So I have a promQL query like so:
count by (clientid) (count by (nodeid, clientid) (active{foo="bar", ...})) > 1
The inner count
filters by the ...
labels (environment etc) and reports a 1 for every pairwise result. The outer count
aggregates by clientid
and then reports those clients hitting more than 1 instance. This works really great, and with grafana I can visualize those clientid's easily. The problem is that I lose the nodeid with the aggregation.
To recover nodeid
I am wrapping that query (which is kind of a WHERE clause) like so:
active{foo="bar", ...} + on(clientid) group_left() (...)
This recovers the original time series for just those clients reported in the first query. Now I can extract the nodeid
's that are being visited. That's useful to get a list of nodeid
s.
But this doesn't mean the clients are active simultaneously against both server nodes. To capture that I have another metric, the timestamp of the connection time. So instead I use my WHERE clause as a kind of join:
timestamp{...} + on(clientid) group_left() (...)
The result (timestamps) have a constant factor added, the count, which is fine. I end up with 2 time series per clientid, one for each nodeid. If I plot this as a timeseries I can visually see how tight or distant the two are and draw a conclusion.
I cannot instrument the clients. I have to infer dual connection from the server/nodeid data.
If you skipped the above, the summary is I have a metric with 2 time series for each clientid
, that differ on a single label, nodeid
. The value of these time series is a unix timestamp. I am trying to identify those time series pairs where the timestamps coincide.
Rather than look at lines and see where they coincide, I would like to generate a third TS from the two for each clientid, the timestamp value difference between them. Then my visualization would be more horizontal than a line up and to the right, in most cases centered on zero but in some rarer cases I'd see that a client just switched servers. I'm not having luck figuring this out. It could either be done in promQL or as a grafana transform, I'm not picky.
The resultant TS or data needs to carry the clientid
label so I can use that as the legend.
I'm actually using VictoriaMetrics so an MQL solution is great also.
"difference" is great when there are exactly 2 nodeids, let's assume that and solve that first, but what if a client is hitting three nodes?
Prometheus/PromQL subtract two gauge metrics because the label values that differentiate the two time series are constants in that example. In my time series I can have arbitrary label pairs.
I used the time() function to subtract the time of sample. this doesn't center on 0 because the time that the metric is recorded isn't the same as the sample time, however it does produce a graph where perfectly dual simultaneous connections appear as 2 coincident horizontal lines at whatever offset from connection time to sample time. offsets within a pair are now also easily visible as the y scale of the graph is now limited to the sample period, not to the range of the query.