We're exploring using Redshift for our warehouse and we need to move new data from our existing on-premises postgres db to Redshift nightly. It looks like you can do this with Pipeline and a template if your primary db is in RDS, but can you do it with Pipeline if your database is on-premises?
Since you are synching in bulk on a periodic basis you might want to consider pushing your data to S3, which is easy to do from on-premises, and then loading it into redshift via the COPY command. It's fast and reliable and you get a backup for free (cheap) in s3 as a side effect.
You can use aws s3 cp
from the awscli python tool to push up your CSV postgres dumps (after putting your access key in .aws/config
) with something like:
aws s3 cp current_dump.csv.gz s3://yourbucket/20170108/dump.csv.gz
And then the regular postgres psql utility to execute the COPY on redshift like so:
PGPASSWORD='YOURPASS' psql -h your.redshift.end.point.com -U youruser -d yourdb -p 5439 -c "COPY yourtable FROM 's3://yourbucket/20170108/dump.csv.gz' CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]' DELIMITER ',' NULL 'NULL' IGNOREBLANKLINES EMPTYASNULL BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1 CSV GZIP;"
You could put your postgres dump and these commands in a shell script with a bit scripting of time values and run it as a cron job.
For best practice you want to upload to a staging table and then merge to your main table to support updates if needed and prevent duplicates.