Search code examples
postgresqlreplicationdatabase-replicationpgpool

Manage conflicts and lag on Postgres Replication in Hot Standby with read heavy Slave


Requirement:

Avoid terminating connection due to conflict with recovery error and also have acceptable replication lag.

Google Cloud PostgreSQL 9.6, Replication turned on (uses Streaming replication), PGPool-II set to only do load balancing and with following properties on slave:

work_mem    3276800
commit_delay    100
max_wal_size    940
max_standby_archive_delay   -1
max_standby_streaming_delay -1
hot_standby_feedback    on

Machine config:

vCPUs:8, Memory: 30 GB, SSD storage: 76 GB

Workload:

Master fully loaded with writes and reads, and slave also fully loaded with lots of reads. The max length of queries might be around 8-10 secs.

What we tried before:

  • Set max_standby_archive_delay and max_standby_streaming_delay to 900000 (900 secs), however we were seeing a lot of conflict errors.

  • Set max_standby_archive_delay and max_standby_streaming_delay to -1, this made the conflict errors go away, however the lag increased a lot (somewhere around 23mins)

  • Set max_standby_archive_delay and max_standby_streaming_delay to -1 and hot_standby_feedback to on. This also made the conflict errors go away, however we are still seeing replication lags (around 500 secs)

Query used for lag:

SELECT
  pg_last_xlog_receive_location() receive,
  pg_last_xlog_replay_location() replay,
  (
   extract(epoch FROM now()) -
   extract(epoch FROM pg_last_xact_replay_timestamp())
  )::int lag;

Graph of lag measured every 1 sec over a period of 9 hours:

graph

Questions:

  1. Given our use-case (Slave being actively used for read queries, how do we make sure we have no conflict errors and a reasonable lag (around few secs)
  2. What does the lag mean? Does it mean only one of the table is behind Master? Or does it mean all other WALs are also pending to be applied on Slave.
  3. If 1. is not achievable using config properties, how do we solve it in code (This is the least desirable since the code base is vast and will require lots of changes)

Thanks!


Solution

  • You cannot totally avoid conflicts — every statement like TRUNCATE or ALTER TABLE that requires an ACCESS EXCLUSIVE lock will lead to a replication conflict.

    But you can avoid replication conflicts caused by VACUUM:

    • Set hot_standby_feedback = on to keep PostgreSQL from removing tuples still needed on the standby.

    • Set old_snapshot_threshold to a (possibly high) value other than the default to avoid vacuum truncation.

      This truncation requires an ACCESS EXCLUSIVE lock that can also lead to conflicts.

    For the remaining conflicts, you have a choice between delayed application and query cancelation. Or you change the workload to avoid ACCESS EXCLUSIVE locks.

    To find out what is blocking you, you'll have to use pg_xlogdump on the WAL files and search for ACCESS EXCLUSIVE locks. This will allow you to figure out which object is locked. To find out what kind of operation is performed, check the WAL entries immediately before (VACUUM?) or immediately afterwards (DDL?).