I am trying to create a hot_standby server, and I receive the following error after pg_basebackup completes. Notice I use a shell script, replicator.sh, to start the replication. Can anyone give me some insight?
My specs:
Error:
postgres@database-master:/etc/postgresql/9.3/main$ sh replicator.sh
Stopping PostgreSQL
[ ok ] Stopping PostgreSQL 9.3 database server: main.
Cleaning up old cluster directory
Starting base backup as replicator
Password:
113720266/113720266 kB (100%), 1/1 tablespace
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
pg_basebackup: base backup completed
Starting Postgresql
[....] Starting PostgreSQL 9.3 database server: main[....] The PostgreSQL server failed to start.
Please check the log output: 2014-09-11 17:56:33 UTC LOG: database system was interrupted; last
known up at 2014-09-11 16:54:29 UTC 2014-09-11 17:56:33 UTC LOG: creating missing WAL directory
"pg_xlog/archive_status" 2014-09-11 17:56:33 UTC LOG: incomplete startup packet 2014-09-11 17:56:33
UTC LOG: invalid checkpoint record 2014-09-11 17:56:33 UTC FATAL: could not locate required
checkpoint record 2014-09-11 17:56:33 UTC HINT: If you are not restoring from a backup, try
removing the file "/var/lib/p[FAILesql/9.3/main/backup_label". 2014-09-11 17:56:33 UTC LOG: startup
process (PID 21972) exited with exit code 1 2014-09-11 17:56:33 UTC LOG: aborting startup due to
startup process failure ... failed! failed!
Contents of replicator.sh:
#!/bin/bash
echo Stopping PostgreSQL
/etc/init.d/postgresql stop
echo Cleaning up old cluster directory
rm -rf /var/lib/postgresql/9.3/main
echo Starting base backup as replicator
pg_basebackup -h 123.456.789.123 -D /var/lib/postgresql/9.3/main -U replicator -v -P
echo Writing recovery.conf file
sudo -u postgres bash -c "cat > /var/lib/postgresql/9.3/main/recovery.conf <<- _EOF1_
standby_mode = 'on'
primary_conninfo = 'host=123.456.789.123 port=5432 user=replicator password=XXXXX sslmode=require'
trigger_file = '/tmp/postgresql.trigger'
_EOF1_
"
echo Starting Postgresql
/etc/init.d/postgresql start
Thank you, Jake
My best guess from the above is that the pg_basebackup
failed and your shell script doesn't check for error return codes or use set -e
to automatically abort after errors, so it just carried on regardless.
It's also possible that you don't have WAL archiving configured, or don't have a restore_command
set in the replica. In that case, the transaction logs required to start the base backup will not be available and startup will fail.
I strongly recommend that you:
Use pg_basebackup -X stream
so that the required transaction logs get copied along with the backup; and
Use set -e
in your shell script, or test for errors with a suitable if ! pg_basebackup .... ; then
block.