I’m looking for a way to monitor in graphana replication among the nodes of a PostgreSQL cluster.
I have been reviewing and my version has the following options available:
pg_stat_replication_reply_time pg_stat_replication_pg_wal_lsn_diff pg_stat_replication_pg_current_wal_lsn_bytes
Currently I'm working with the pg_stat_replication_reply_time metric which according to what I have researched measures the time it takes to write from the primary node to the standby.
It's the first time that I work with these monitoring tools, can you advise me about the metric I'm using is correct?
I personally prefer using either pg_stat_replication_pg_wal_lsn_diff
or pg_replication_slots_pg_wal_lsn_diff
gauges. The second is better (especially for alerts) but it only works if you're using replication slots. Both show how far (in bytes) a replica is behind the master. Since it's a gauge, you don't need any functions to use it, just put the name into the query box
pg_stat_replication_pg_wal_lsn_diff
# or, if you have variables
pg_stat_replication_pg_wal_lsn_diff{instance="$instance"}
As for the other metrics you mentioned, here's what I know:
pg_stat_replication_pg_current_wal_lsn_bytes
- metric description says it's "WAL position in bytes". I guess if you subtract the value of a master server from the value of a replica, the result should be identical to pg_stat_replication_pg_wal_lsn_diff
.
pg_stat_replication_reply_time
- metric description says it's "Unknown metric from pg_stat_replication", but actually it is mentioned in PostgreSQL docs:
reply_time
timestamp with time zoneSend time of last reply message received from standby server
What does that "reply message" mean (in which situation it is sent) I don't know, so I stick to using bytes instead. However if you wish to use it then since it's a timestamp, you need a query like this:
time() - pg_stat_replication_reply_time
You can also apply rate()
instead and if everything's OK the result should be a straight line with the value close to 1 (because rate()
calculates a per-second increase of a counter). In my experience this thing does not work properly with WAL-consumers other than PostgreSQL (e.g. Debezium), while pg_stat_replication_pg_wal_lsn_diff
remains to work as intended.