Search code examples
prometheusgrafanapromqlvictoriametricsmetricsql

promQL or metricsQL subtract two time series


Preface

My distributed server application emits a gauge for each active client. I want to see which clients are active against 2 nodes simultaneously. So I have a promQL query like so:

count by (clientid) (count by (nodeid, clientid) (active{foo="bar", ...})) > 1

The inner count filters by the ... labels (environment etc) and reports a 1 for every pairwise result. The outer count aggregates by clientid and then reports those clients hitting more than 1 instance. This works really great, and with grafana I can visualize those clientid's easily. The problem is that I lose the nodeid with the aggregation.

To recover nodeid I am wrapping that query (which is kind of a WHERE clause) like so:

active{foo="bar", ...} + on(clientid) group_left() (...)

This recovers the original time series for just those clients reported in the first query. Now I can extract the nodeid's that are being visited. That's useful to get a list of nodeids.

But this doesn't mean the clients are active simultaneously against both server nodes. To capture that I have another metric, the timestamp of the connection time. So instead I use my WHERE clause as a kind of join:

timestamp{...} + on(clientid) group_left() (...)

The result (timestamps) have a constant factor added, the count, which is fine. I end up with 2 time series per clientid, one for each nodeid. If I plot this as a timeseries I can visually see how tight or distant the two are and draw a conclusion.

I cannot instrument the clients. I have to infer dual connection from the server/nodeid data.

Problem

If you skipped the above, the summary is I have a metric with 2 time series for each clientid, that differ on a single label, nodeid. The value of these time series is a unix timestamp. I am trying to identify those time series pairs where the timestamps coincide.

Rather than look at lines and see where they coincide, I would like to generate a third TS from the two for each clientid, the timestamp value difference between them. Then my visualization would be more horizontal than a line up and to the right, in most cases centered on zero but in some rarer cases I'd see that a client just switched servers. I'm not having luck figuring this out. It could either be done in promQL or as a grafana transform, I'm not picky.

The resultant TS or data needs to carry the clientid label so I can use that as the legend.

I'm actually using VictoriaMetrics so an MQL solution is great also.

"difference" is great when there are exactly 2 nodeids, let's assume that and solve that first, but what if a client is hitting three nodes?

What Didn't Work

Prometheus/PromQL subtract two gauge metrics because the label values that differentiate the two time series are constants in that example. In my time series I can have arbitrary label pairs.


Solution

  • I used the time() function to subtract the time of sample. this doesn't center on 0 because the time that the metric is recorded isn't the same as the sample time, however it does produce a graph where perfectly dual simultaneous connections appear as 2 coincident horizontal lines at whatever offset from connection time to sample time. offsets within a pair are now also easily visible as the y scale of the graph is now limited to the sample period, not to the range of the query.