Avoid terminating connection due to conflict with recovery
error and also have acceptable replication lag
.
Google Cloud PostgreSQL 9.6, Replication turned on (uses Streaming replication), PGPool-II set to only do load balancing and with following properties on slave:
work_mem 3276800
commit_delay 100
max_wal_size 940
max_standby_archive_delay -1
max_standby_streaming_delay -1
hot_standby_feedback on
Machine config:
vCPUs:8, Memory: 30 GB, SSD storage: 76 GB
Workload:
Master fully loaded with writes
and reads
, and slave also fully loaded with lots of reads
.
The max length of queries might be around 8
-10
secs.
What we tried before:
Set max_standby_archive_delay
and max_standby_streaming_delay
to 900000
(900 secs), however we were seeing a lot of conflict
errors.
Set max_standby_archive_delay
and max_standby_streaming_delay
to -1
, this made the conflict errors go away, however the lag increased a lot (somewhere around 23mins
)
Set max_standby_archive_delay
and max_standby_streaming_delay
to -1
and hot_standby_feedback
to on
. This also made the conflict errors go away, however we are still seeing replication lags (around 500 secs
)
Query used for lag:
SELECT
pg_last_xlog_receive_location() receive,
pg_last_xlog_replay_location() replay,
(
extract(epoch FROM now()) -
extract(epoch FROM pg_last_xact_replay_timestamp())
)::int lag;
Graph of lag measured every 1 sec over a period of 9 hours
:
Questions:
Thanks!
You cannot totally avoid conflicts — every statement like TRUNCATE
or ALTER TABLE
that requires an ACCESS EXCLUSIVE
lock will lead to a replication conflict.
But you can avoid replication conflicts caused by VACUUM
:
Set hot_standby_feedback = on
to keep PostgreSQL from removing tuples still needed on the standby.
Set old_snapshot_threshold
to a (possibly high) value other than the default to avoid vacuum truncation.
This truncation requires an ACCESS EXCLUSIVE
lock that can also lead to conflicts.
For the remaining conflicts, you have a choice between delayed application and query cancelation. Or you change the workload to avoid ACCESS EXCLUSIVE
locks.
To find out what is blocking you, you'll have to use pg_xlogdump
on the WAL files and search for ACCESS EXCLUSIVE
locks. This will allow you to figure out which object is locked. To find out what kind of operation is performed, check the WAL entries immediately before (VACUUM
?) or immediately afterwards (DDL?).