We are planning for a large Greenplum DB (growing from 10 to 100TB over the first 18 months). Traditional backup and restore tools aren't going to help as we have 24hr RPO/RTOs to deal with.
Is there a way to replicate the DB across to our DR site without resorting to block replication (i.e. place a segment on SAN and mirror)?
You've got a number of options to choose:
- Dual ETL. Replicate input data and run the same ETL on two sites. Synchronize them with backup-restore every week or so
- Backup-restore. Simple backup-restore can be not that efficient. But if you use DataDomain it can perform deduplication on the block level and store only changed blocks. It can offload the deduplication task to run on the Greenplum cluster (DDBoost). Also in case of replication to remote site it would replicate only changed blocks, which would greatly reduce replication time. In my experience, if clean backup on DD takes 12 hours, subsequent DDBoost backup will take 4 hours + 4 hours to replicate the data
- Custom solution. I know the case when the data replicatioin to remote site is made as a part of ETL process. For the ETL job you know which tables are changed, they are added to the replication queue and moved to the remote site using external tables. Analysts are working in a special sandbox and their sandbox is replicated with backup-restore daily
At the moment Greenplum does not have built-in WAN replication solution so this is almost all the options to choose from.