Search code examples
postgresqlprometheusgrafana

How to measure the replication in PostgreSQL with Grafana


I’m looking for a way to monitor in graphana replication among the nodes of a PostgreSQL cluster.

I have been reviewing and my version has the following options available:

pg_stat_replication_reply_time pg_stat_replication_pg_wal_lsn_diff pg_stat_replication_pg_current_wal_lsn_bytes

Currently I'm working with the pg_stat_replication_reply_time metric which according to what I have researched measures the time it takes to write from the primary node to the standby.

enter image description here

It's the first time that I work with these monitoring tools, can you advise me about the metric I'm using is correct?


Solution

  • I personally prefer using either pg_stat_replication_pg_wal_lsn_diff or pg_replication_slots_pg_wal_lsn_diff gauges. The second is better (especially for alerts) but it only works if you're using replication slots. Both show how far (in bytes) a replica is behind the master. Since it's a gauge, you don't need any functions to use it, just put the name into the query box

    pg_stat_replication_pg_wal_lsn_diff
    # or, if you have variables
    pg_stat_replication_pg_wal_lsn_diff{instance="$instance"}
    

    As for the other metrics you mentioned, here's what I know:

    • pg_stat_replication_pg_current_wal_lsn_bytes - metric description says it's "WAL position in bytes". I guess if you subtract the value of a master server from the value of a replica, the result should be identical to pg_stat_replication_pg_wal_lsn_diff.

    • pg_stat_replication_reply_time - metric description says it's "Unknown metric from pg_stat_replication", but actually it is mentioned in PostgreSQL docs:

      reply_time timestamp with time zone

      Send time of last reply message received from standby server

      What does that "reply message" mean (in which situation it is sent) I don't know, so I stick to using bytes instead. However if you wish to use it then since it's a timestamp, you need a query like this:

      time() - pg_stat_replication_reply_time
      

      You can also apply rate() instead and if everything's OK the result should be a straight line with the value close to 1 (because rate() calculates a per-second increase of a counter). In my experience this thing does not work properly with WAL-consumers other than PostgreSQL (e.g. Debezium), while pg_stat_replication_pg_wal_lsn_diff remains to work as intended.