database postgresql archive database-backups

Backup PostgreSQL database hosted on AWS EC2 without shutting down or restarting the master

I'm using PostgreSQL v9.1 for my organization. The database is hosted in Amazon Web Services (EC2 instance) below a Django web-framework which performs tasks on the database (read/write data). The problem is, to backup this database in a periodic fashion in a specified format (see Requirements).

Requirements:

A standby server is available for backup purposes.
The master-db is to be backed up every hour. Once the hour is ticked, the db is quickly backed up in entirety and then copied to slave in a file-system archive.
Along with hourly backups, I need to perform a daily backup of the database at midnight and a weekly backup on midnight of every Sunday.
Weekly-backups will be the final backups of the db. All weekly-backups will be saved. Daily-backups of the last week will only be saved and Hourly-backups of the last day will only be saved.

But I have the following constraints too.

Live data comes into the server every day (rate of insertion is per 2 seconds).
The database now hosting critical customer data which implies that it cannot be turned off.
Usually, data stops coming into the db during nights, but there's a good chance that data might be coming into master-db during some nights for which I have no control over to stop the insertions (Customer-data will be lost)
If I use traditional backup mechanisms/software (example, barman), I've to configuring archiving mode in postgresql.conf and authenticate users in pg_hba.conf which implies I need a server-restart to turn it on which again, stops the incoming data for some minutes. This is not permitted (see above constraint).

Is there a clever way to backup the master-db for my needs? Is there a tool which can automate this job for me?

This is a very crucial requirement as data has begun to appear into the master-db since few days and I need to make sure there's replication of master-db on some standby-server all the time.

Solution

Use EBS snapshots

If, and only if, your entire database including pg_xlog, data, pg_clog, etc is on a single EBS volume, you can use EBS snapshots to do what you describe because they are (or claim to be) atomic. You can't do this if you stripe across multiple EBS volumes.

The general idea is:

Take an EBS snapshot using the EBS APIs using command line AWS tools or a scripting interface like the wonderful boto Python library.
Once the snapshot completes, use AWS API commands to create a volume from it and attach the volume your instance, or preferably to a separate instance, and then mount it.
On the EBS snapshot you will find a read-only copy of your database from the point in time you took the snapshot, as if your server crashed at that moment. PostgreSQL is crashsafe, so that's fine (unless you did something really stupid like set fsync=off in postgresql.conf). Copy the entire database structure to your final backup, e.g archive it to S3 or whatever.
Unmount, unlink, and destroy the volume containing the snapshot.

This is a terribly inefficient way to do what you want, but it will work.

It is vitally important that you regularly test your backups by restoring them to a temporary server and making sure they're accessible and contain the expected information. Automate this, then check manually anyway.

Can't use EBS snapshots?

If your volume is mapped via LVM, you can do the same thing at the LVM level in your Linux system. This works for the lvm-on-md-on-striped-ebs configuration. You use lvm snapshots instead of EBS, and can only do it on the main machine, but it's otherwise the same.

You can only do this if your entire DB is on one file system.

No LVM, can't use EBS?

You're going to have to restart the database. You do not need to restart it to change pg_hba.conf, a simple reload (pg_ctl reload, or SIGHUP the postmaster) is sufficient, but you do indeed have to restart to change the archive mode.

This is one of the many reasons why backups are not an optional extra, they're part of the setup you should be doing before you go live.

If you don't change the archive mode, you can't use PITR, pg_basebackup, WAL archiving, pgbarman, etc. You can use database dumps, and only database dumps.

So you've got to find a time to restart. Sorry. If your client applications aren't entirely stupid (i.e. they can handle waiting on a blocked tcp/ip connection), here's how I'd try to do it after doing lots of testing on a replica of my production setup:

Set up a PgBouncer instance
Start directing new connections to the PgBouncer instead of the main server
Once all connections are via pgbouncer, change postgresql.conf to set the desired archive mode. Make any other desired restart-only changes at the same time, see the configuration documentation for restart-only parameters.
Wait until there are no active connections
SIGSTOP pgbouncer, so it doesn't respond to new connection attempts
Check again and make sure nobody made a connection in the interim. If they did, SIGCONT pgbouncer, wait for it to finish, and repeat.
Restart PostgreSQL
Make sure I can connect manually with psql
SIGCONT pgbouncer

I'd rather explicitly set pgbouncer to a "hold all connections" mode, but I'm not sure it has one, and don't have time to look into it right now. I'm not at all certain that SIGSTOPing pgbouncer will achieve the desired effect, either; you must experiment on a replica of your production setup to ensure that this is the case.

Once you've restarted

Use WAL archiving and PITR, plus periodic pg_dump backups for extra assurance.

See:

... and of course, the backup chapter of the user manual, which explains your options in detail. Pay particular attention to the "SQL Dump" and "Continuous Archiving and Point-in-Time Recovery (PITR)" chapters.

PgBarman automates PITR option for you, including scheduling, and supports hooks for storing WAL and base backups in S3 instead of local storage. Alternately, WAL-E is a bit less automated, but is pre-integrated into S3. You can implement your retention policies with S3, or via barman.

(Remember that you can use retention policies in S3 to shove old backups into Glacier, too).

Reducing future pain

Outages happen.

Outages of single-machine setups on something as unreliable as Amazon EC2 happen a lot.

You must get failover and replication in place. This means that you must restart the server. If you do not do this, you will eventually have a major outage, and it will happen at the worst possible time. Get your HA setup sorted out now, not later, it's only going to get harder.

You should also ensure that your client applications can buffer writes without losing them. Relying on a remote database on an Internet host to be available all the time is stupid, and again, it will bite you unless you fix it.